On the other python is a pretty powerful language. It comes with many library, include those that can be use to do HTTP request. Introducing urllib2, it is part of standard library. To use it to download data from a website can be done in 3 line of code
The problem, then is you get a whole set of HTML, which a bit hard to process. Then python have a few third party library, the one I use is Beautiful Soup. Beautiful Soup is nice that it is very forgiving in processing bad markup in HTML. So you don't need to worry about bad format and focus to get things done. The library itself can also parse XML, among other thing.import urllib2
page = urllib2.urlopen("url")
To use Beautiful Soup,
But you need to get the html first don't you?from BeautifulSoup
import BeautifulSoup
page = "html goes here"soup = BeautifulSoup(page)
value = soup.findAll('div')
print value[0].text
To use it, just download the data using urllib2 and pass to to beautiful soup. To use it is pretty easy, to me anyway. Though, urllib2 is going to be re organized in python 3. So code need some modification.import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("url")
soup = BeautifulSoup(page)
value = soup.findAll('div')
print value[0].text
To see how the scraper fare, here is a real world example, in github part of a bigger project. But hey it is open source. Just fork and use it, in the this link.
So enjoy go forth and extract some data, and promise to be nice, don't hammer their server.
No comments:
Post a Comment