So recently thanks to Kaeru, introduced to me, scraperwiki. This is basically a service for you to run scraper on the cloud, with additional benefits:
- It runs on the cloud
- It provide infrastructure to store the data, in form of sqlite database, which you can download.
- It provide easy way to dump data as excel
- It provide infrastructure to convert the data into API
- Somebody can fork the scraper and do enhancement on it.
- A web based IDE, so you just write your scraper on it.
- Everybody can see the code of the public scraper.
- Scheduled task
One very cool thing about scraper wiki is, it support a set of third large library that can be used. It support Ruby, PHP, as well as Python. The API for scraper wiki is pretty extensive, it both covers it's own scraper, geocoding function, views for the data hosted on scraper wiki etc.
My only concern is, let say I want bring my scraper out of the service, I will need to rewrite the saving function. But on the the data can be downloaded anyway, and I use python, so it is not that big of a deal.
Below is a scraper that I have written, on scraper wiki. While it is mostly a work in progress, it show how it would look like.