Saturday, October 29, 2011

Converting PDF to Text

So I have recently involved with a project to extract data from PDF. Which is actually evil, but that is not important now.

On linux there is a set of utilities comes with xpdf program. It should be part of the default package installation, if not, you just apt-get or yum it.

On windows you can go to the gunwin32 page, I just download the zip just so i would not have to remove it with a uninstaller.

I don't really need the layout information, on it. so I just use pdftotext.

On windows
program_location/pdftotext.exe -layout pdf_file.pdf

On linux, just
pdftotext -layout pdf_file.pdf

The -layout would maintain the layout of the text as from the pdf. Otherwise, the positioning for certain text will be inconsistent.


Monday, October 24, 2011

Saturday, October 08, 2011

A scraper running on the cloud

I have been writing scraper for sometime, as you can see in some of my old post here.

So recently thanks to Kaeru, introduced to me, scraperwiki. This is basically a service for you to run scraper on the cloud, with additional benefits:

  • It runs on the cloud
  • It provide infrastructure to store the data, in form of sqlite database, which you can download.
  • It provide easy way to dump data as excel
  • It provide infrastructure to convert the data into API 
  • Somebody can fork the scraper and do enhancement on it. 
  • A web based IDE, so you just write your scraper on it. 
  • Everybody can see the code of the public scraper. 
  • Scheduled task
One very cool thing about scraper wiki is, it support a set of third large library that can be used. It support Ruby, PHP, as well as Python. The API for scraper wiki is pretty extensive, it both covers it's own scraper, geocoding function, views for the data hosted on scraper wiki etc. 

My only concern is, let say I want bring my scraper out of the service, I will need to rewrite the saving function. But on the the data can be downloaded anyway, and I use python, so it is not that big of a deal. 

Below is a scraper that I have written, on scraper wiki. While it is mostly a work in progress, it show how it would look like.