sweemeng's tech adventure: October 2011

Saturday, October 29, 2011

Converting PDF to Text

So I have recently involved with a project to extract data from PDF. Which is actually evil, but that is not important now.

On linux there is a set of utilities comes with xpdf program. It should be part of the default package installation, if not, you just apt-get or yum it.

On windows you can go to the gunwin32 page, I just download the zip just so i would not have to remove it with a uninstaller.
http://gnuwin32.sourceforge.net/packages/xpdf.htm

I don't really need the layout information, on it. so I just use pdftotext.

On windows

program_location/pdftotext.exe -layout pdf_file.pdf

On linux, just

pdftotext -layout pdf_file.pdf

The -layout would maintain the layout of the text as from the pdf. Otherwise, the positioning for certain text will be inconsistent.

Cheers

Monday, October 24, 2011

My Post On Robots Making

I have started a series of post on making robots with arduino at hackerspacekl website. You can find out more on the links :
http://www.hackerspace.my/2011/10/24/making-a-robot-with-arduino-part-1-intro-to-motor-controller/trackback

Saturday, October 08, 2011

A scraper running on the cloud

I have been writing scraper for sometime, as you can see in some of my old post here.

So recently thanks to Kaeru, introduced to me, scraperwiki. This is basically a service for you to run scraper on the cloud, with additional benefits:

It runs on the cloud
It provide infrastructure to store the data, in form of sqlite database, which you can download.
It provide easy way to dump data as excel
It provide infrastructure to convert the data into API
Somebody can fork the scraper and do enhancement on it.
A web based IDE, so you just write your scraper on it.
Everybody can see the code of the public scraper.
Scheduled task

One very cool thing about scraper wiki is, it support a set of third large library that can be used. It support Ruby, PHP, as well as Python. The API for scraper wiki is pretty extensive, it both covers it's own scraper, geocoding function, views for the data hosted on scraper wiki etc.

My only concern is, let say I want bring my scraper out of the service, I will need to rewrite the saving function. But on the the data can be downloaded anyway, and I use python, so it is not that big of a deal.

Below is a scraper that I have written, on scraper wiki. While it is mostly a work in progress, it show how it would look like.

https://scraperwiki.com/scrapers/malaysian_parliament_hansard_url/