Monday, August 29, 2011

Python Web Scraping

There is time where there is information in govt website of is very useful, but unfortunately the data is in form of website, it could be worst as it can be in PDF. So it can be a pain if we wanted to use information for programming, but there is no API.

On the other python is a pretty powerful language. It comes with many library, include those that can be use to do HTTP request. Introducing urllib2, it is part of standard library. To use it to download data from a website can be done in 3 line of code
import urllib2 
page = urllib2.urlopen("url")
The problem, then is you get a whole set of HTML, which a bit hard to process. Then python have a few third party library, the one I use is Beautiful Soup. Beautiful Soup is nice that it is very forgiving in processing bad markup in HTML. So you don't need to worry about bad format and focus to get things done. The library itself can also parse XML, among other thing.

To use Beautiful Soup,
from BeautifulSoup
import BeautifulSoup
page = "html goes here"soup BeautifulSoup(page)
value = soup.findAll('div')
print value[0].text
But you need to get the html first don't you?

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("url")
soup = BeautifulSoup(page)
value = soup.findAll('div')
print value[0].text
To use it, just download the data using urllib2 and pass to to beautiful soup. To use it is pretty easy, to me anyway. Though, urllib2 is going to be re organized in python 3. So code need some modification.

To see how the scraper fare, here is a real world example, in github part of a bigger project. But hey it is open source. Just fork and use it, in the this link.

So enjoy go forth and extract some data, and promise to be nice, don't hammer their server.

No comments:

Post a Comment