Monday, April 18, 2016

Web Scraping in python for beginners

My need for web scraping was a result of the inconvenience I had when I was searching for a trimmer in an e commerce website. When I entered the keyword trimmer the website provides information like the item name and the price, but I wanted to know what the warranty for each item.  In order to do this, I had to click on each link and then find out which was a painful process.If you are someone who suffers from similar situations and would like to know the quick steps in performing a quick webscrap, this blog is an attempt at doing just that.
So, let's consider the steps to retrieve this information.
  1. Go to the website and search for trimmers. Save the contents of the page somewhere.
  2. From the search results go into first link
  3. Extract the name of the item, the price of the item and the warranty.
  4. Open a file and save the results as a csv file in a new line
  5. Repeat 2 to 4
I have tried my best to avoid technical jargons but it would be easier if you know a little about html tags and functions.  Let's get started.
Tools of the trade...In order to write a code in python you will have to have an editor(a window where you can write the code). I would recommend downloading this editor called 'atom' or 'sublime' for this purpose. Now, you will need a folder to store your files, so go ahead and create one. Open either 'sublime' or 'atom' and create a new file, save this file in the newly created folder with a '.py' extension. Let's name this name file 'my_first_web_scrap.py'
The CODE...Let's get back to our steps and approach the code one step at a time.
Step 1. Go to the website and search for trimmers. Save the contents of the page somewhere.
#Import two packages that allows python to go to the website and get the source code
        import requests        from bs4 import BeautifulSoup
#Instruct python to get the source from the url in double quotes and save it in a variable called 'request'
                request=requests.get("http://www.flipkart.com/search?q=trimmers&otracker=start&as-show=on&as=off").text
#BeautifulSoup helps organize the webpage into a nested structure. You don't need to know html/css to understand but a quick read into how elements are organized in page will help you while extracting data. We store this value in a variable called 'soup'
                 soup = BeautifulSoup(request)
Please take some time to read this block as the rest of the code is based on this.                                            Now we have the html file, this is the same file which you see when you right click on the webpage and click on ‘View Page Source’ in a chrome browser. Go ahead and open that page. The first item in the search result was ‘Philips QT4005/15 Trimmer (Black)’ for me. When you do a search for this item in the source page, you can observe that the there is a link that will take you to the item’s home page. Now we need to tell python to go to this link. In order to do that we need to extract the link.                                           As you can see, this link is inside a “href”. If you have no idea about what a href is, ignore it. You can google it later. Upon further observation you see that the href is contained in a block and the block is named “fk-display-block”. Okay, these are called classes but but if you’re not familiar with html, you can ignore it now but I strongly encourage you to know a little about html and css.                                          So back to our code. Now I am going to tell python to find all the blocks(each block starts with ‘<>’ and ends with a ‘</>’) that starts with <a and is followed by a  class = “fk-display-block”.
#For every line in the 'soup' variable, find all the tags that contain a 'div' immediately followed by 'class = pu-details lastUnit'.
                    for line in soup.findAll('div',class_ = 'pu-details lastUnit'):
Now, the variable ‘line’ will have ‘<a class="fk-display-blockdata-tracking-id="prd_titlehref="/philips-qt4005-15-trimmer/p/itmdze53vthypqhb?pid=SHVDGGZPC8PXJ7HR&al=WEMoZ3qy9WifHC5MlOdF%2FcldugMWZuE7eGHgUTGjVrpjizeD%2FNvlpAEwWx6I1Qy9R9ViMaFmI%2Bc%3D&ref=L%3A556047148293374932&srno=b_1title="Philips QT4005/15 Trimmer (Black) ">’ value stored in it.
Step 2. From the search results click on the available links
#Similar to the above step, we are asking python to find the tags that start with an 'a'  and followed by 'href'              div = line.find('a')['href']
#Now we concatenate this link with the flipkart’s homepage url.
               url_one = "https://www.flipkart.com" + str(div)
#We then ask python to fetch the html of the page and save it in a variable called 'request_one'
               request_one = requests.get(url_one).text
#Like before, we convert 'soup_one' into a nested structure
               soup_one = BeautifulSoup(request_one)               print ("processing info") #Just a small message to let the user know some processing is going on.
Step 3 : Extract the name of the item, the price of the item and the warranty.
#For every line in 'soup_one' find all tags starting with 'span' immediately followed by 'class_ = 'selling-price omniture-field''.This is exactly how we extracted the links of each item from Step 1.                    for price in soup_one.findAll('span',class_ = 'selling-price omniture-field'):
#From the first hit save the 'text' value to a variable
                     cost = price.text
#Replace the commas in the price with blanks. We do this just to ensure that our output does not look messed up. You can ignore it if you want to.
                     cost = cost.replace(',','')
#For every line in 'soup_one' find all the tags starting with 'span' immediately followed by 'class_='warranty-text''. Similar to step 1.                     for warranty in soup_one.findAll('span',class_="warranty-text"):                     print("Fetching warranty") #Just showing a message to the user. 
#Fetch the text value and store it in a variable
                      a = warranty.text
#Covert the value to a string.
                      s = str(a)
Step 4 : Open a file and save the results as a csv file in a new line
#Open a file in the local folder with write permissions where you want to store the extracted values. You don't have to create a file manually in the folder. Running this command will automatically do that for you. The 'w' is to instruct python that you are creating/opening this file to write some information in it.
                f = open("file.csv",'w')
#Print the obtained values to the file  that we created. soup_one.title.text.rstrip() gets the name of the item.              f.write("{},{},{}".format(soup_one.title.text.rstrip(), cost, s))              print ("file written")              f.close()
Step 5 : Repeat 2 to 4
This is taken care of by the ‘for’ loop which we had given. After getting the first result it scans the rest of the document to get the next result and keeps writing the results into the file until it reaches the end of the document.
See the full code here, Please take note that python is very sensitive about indentation.
import requests
from bs4 import BeautifulSoup
f = open("file.csv",'w')
request = requests.get("http://www.flipkart.com/search?q=trimmers&otracker=start&as-show=on&as=off").text
soup = BeautifulSoup(request)
for classs in soup.findAll('div',class_ = 'pu-details lastUnit'):
 div = classs.find('a')['href']
 url_one = "https://www.flipkart.com" + str(div)
 request_one = requests.get(url_one).text
 soup_one = BeautifulSoup(request_one)
 print ("processing info")
 for price in soup_one.findAll('span',class_ = 'selling-price omniture-field'):
 cost = price.text
 cost = cost.replace(',','')
 print ("fetching price")
 for warranty in soup_one.findAll('span',class_="warranty-text"):
 print("Fetching warranty")
 a = warranty.text
 f.write("{},{},{}".format(soup_one.title.text.rstrip(), cost, str(a)))
print ("file written")
f.close()
          

No comments:

Post a Comment