Web Scraping in Python

Posted on Jan 01, 2014

Web Scraping is a process of extracting out useful information form any website. Sometimes it is also refered as web mining, web crawling and web parsing, but all of them have the same meaning. Python provides many different modules and liberaries to achieve the purpose. Using techniques and proper modules of web scraping one can extract out useful information from the html text. Modules like urllib, requests etc are used to get the complete html text of a webpage from which data can be extracted. Beautiful Soup is a python liberary which is used for the purpose of web data mining. The scraping process starts with requesting a webpage from python script, parsing the reponse to html text followed by making Beautiful Soup object using html text and finally applying techniques to get useful data. The following tutorial will guide you to scrap a webpage.

Modules required: urrlib(or requests, mechanize etc) and BeautifulSoup.
To Install these modules on your machine run the following scripts on your machine.

pip install BeautifulSpup
pip install urllib

Setup: First, import the required modules

from BeautifulSoup import BeautifulSoup
import urllib

HTML Response: The next step is to send a request to a webpage and get the response in the form of html text. For this purpose i have used urllib. The basic snippet is like this

url = "http://www.google.com"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()

The soup: Using html text, we need to create an object of Beautiful Soup, the soup variable is now ready to apply methods and functions to get data.

soup = BeautifulSoup(htmltext)

Once soup object is initialied, we have got access to various methods and sytaxes to retrieve the data. Here is a small list:
Selecting by Tags:

soup.head
soup.head.title
soup.body.h2.a

Selecting all particular tags:

soup.findAll(\'p\')
for anchor in soup.findAll(\'a\'):
    print anchor

Selcting a particular tag with some attribute

soup.find(\'div\',attrs={"class":"classname"})
soup.findAll(\'p\',attrs={"id":"idname"})

Parent, Children and Sibling

for the following scenario:
<head>
<title> titlename </title>
<a>random</a>
</head>

soup.title.parent # head
soup.head.children # title
soup.title.next_sibling # a
soup.a.prev_sibling # title
soup.a.string # random

Here is the full code to scrap all anchor tags in google.com

import urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.google.com"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext)
print soup.head.title.string
for link in soup.findAll(\'a\'):
    print link

Check out my web scrapping series in python repository on github. i have scrapped webistes like google.finance, yahoo.finance, bloomberg, google.movies, horoscope websites, nytimes.com, irctc.gov.in, weather websites etc. Here is the link. Feel free to discuss and share.