What price is that chair?

Now that we are able to get the contents of a page, there isn't much we can't do! We are one step closer--and really one step away--from being able to make our programs understand basically any piece of information on the internet.

What's left is telling the program what the information is and what it looks like, so it can be found.

In order to do this, we need to be able to understand the content of the pages. The page content is written in HTML, which is extremely similar in structure to XML. We can use a library called BeautifulSoup to allow our program to search through the HTML code.

Installing the required library

Like previously, we will first have to install the library, and then import it. Add the following code to your requirements.txt:

beautifulsoup4==4.4.0

Then, go into your main Python file (which I tend to call app.py), and import the library at the top, like so:

from bs4 import BeautifulSoup

What this import statement means is that installing the beautifulsoup4 library actually is downloading a file called bs4. This file contains a section of code (called a class, but we will look at this later on) called BeautifulSoup. That is the code we are importing, and now we can start using it!

Parsing page contents

The class BeautifulSoup accepts at least one parameter: the page content to parse. Lets include it in our page:

import requests
from bs4 import BeautifulSoup

r = requests.get("http://google.com")
content = r.content
soup = BeautifulSoup(content)

Now the soup variable contains ways to search through the content (by tag name and by attributes). For example, if the page content looked like this:

<data>
  <students>
    <student id="3514m">
      <name>Jose</name>
      <subject>Computer Science</subject>
    </student>
    <student id="881749h">
      <name>Rolf</name>
      <subject>Computer Science</subject>
    </student>
  </students>
</data>

We could perform a search using BeautifulSoup like so:

soup.find("student", {"id": "3514m"})

Which would find a <student> tag (of which there are two), where the tag contains the attribute "id", and it has the value "3514m". Easy, isn't it!

What Chair do we want?

Now, lets go into an online store website and find an item you are interested in. We are going to make a program that will tell us whether we should buy the item or not, based on our budget.

Note: some websites may not work as they block traffic coming from a robot, which our program will be. If you finish the program and it does not work, it may be because of this. Check the page content as it may give you some information.

The site I found was http://johnlewis.com/items/. I know this page works, so it may be a good idea to follow along using that, and then change it afterwards when you know it all works.

Open the page in your browser, and then right-click the large price text, and press "Inspect Element". This will bring up the contents of the page in a panel. It looks very complicated--but it's exactly the same structure as the XML we looked at previous, only at a larger scale.

In the page content, we want to find a line which contains the large price tag that we 'inspected'. It will look something like this:

<span itemprop="price" class="now-price"> £115.00 </span>

Then, lets go into our Python program and make the necessary changes:

import requests
from bs4 import BeautifulSoup

r = requests.get("http://johnlewis.com/items/")
content = r.content
soup = BeautifulSoup(content)
element = soup.find("span", {"itemprop": "price", "class": "now-price"})

print(element.text)

We still have to import both required libraries, but now our request is going to the online store website. We're looking for a "span" tag that has two attributes: itemprop, with value "price" and class, with value "now-price".

When you run this program, you should see something like this printed out!


                  £115.00

It is a bit strange how it has all that whitespace around the price! Lets get rid of that quite easily by using the strip() method. We can call this method on any string in Python.

import requests
from bs4 import BeautifulSoup

r = requests.get("http://johnlewis.com/items/")
content = r.content
soup = BeautifulSoup(content)
element = soup.find("span", {"itemprop": "price", "class": "now-price"})

print(element.text.strip())

And now, running this should give us the expected result:

£115.00

Removing the pound sign

Slicing

In Python it is easy to get split a string into parts, or only get a part of the string that we want. It works like this:

my_string = "hello, world!"
my_string_too = my_string[:]  # This copies the string 'my_string' into 'my_string_too'

hello = my_string[0:5]  # This copies the characters 0 to 5 from 'my_string' into 'hello'
print(hello)  # Would print "hello"

world = my_string[7:12]  # This copies characters 7 to 12 from 'my_string' into 'world'
print(world)  # Would print "world"

all_hello = my_string[:5]  # Just like hello, but the first 0 can be omitted
print(all_hello)  # Would print "hello"

world_all = my_string[7:]  # This copies from character 7 onwards
print(world_all)  # Would print "world!"

Modifying the price string

So now it is easy to go back and remove the pound sign from the price string:

import requests
from bs4 import BeautifulSoup

r = requests.get("http://johnlewis.com/items/")
content = r.content
soup = BeautifulSoup(content)
element = soup.find("span", {"itemprop": "price", "class": "now-price"})

price = element.text.strip()
price_no_currency = price[1:]  # This would copy all except the index 0, which is the pound sign

print(price_no_currency)

The price is still a string though. This means we cannot add it to another price to calculate, for example, total prices. Lets convert it to a number.

Remember the price has a decimal point, so we cannot convert it to an int, as integers are whole numbers. We need to convert it to a float instead: a floating-point number.

import requests
from bs4 import BeautifulSoup

r = requests.get("http://johnlewis.com/items/")
content = r.content
soup = BeautifulSoup(content)
element = soup.find("span", {"itemprop": "price", "class": "now-price"})

price = element.text.strip()
price_no_currency = price[1:]
price_number = float(price_no_currency)

print(price_number)

That's it! Our program now knows the price of the chair, so we can write a small amount more of code to act as a budget. We already know how to do this!

import requests
from bs4 import BeautifulSoup

user_budget = int(input("What is your budget? Enter a whole number: "))

r = requests.get("http://johnlewis.com/items/")
content = r.content
soup = BeautifulSoup(content)
element = soup.find("span", {"itemprop": "price", "class": "now-price"})

price = element.text.strip()
price_no_currency = price[1:]
price_number = float(price_no_currency)

if price_number > user_budget:
  print("This item is over your budget... Sorry!")
else:
  print("Lets buy it!")

Good job getting here. This is the end of the second section, so next we'll be working on creating a blog!

The next section will introduce Object-Oriented Programming as well as our first database system that we are going to use: MongoDB.

What price is that chair?

What price is that chair?

Installing the required library

Parsing page contents

What Chair do we want?

Removing the pound sign

Slicing

Modifying the price string

results matching ""

No results matching ""