Saturday, December 03, 2011

Converting PDF to Text with pdftohtml

Previously I have tried to extract pdf information by converting PDF to text, as described here.

Problem is,  a big wall of text is very hard to process.
Here come pdftohtml it is part of the poppler package on linux. But gnuwin do not have it for windows. Which is one reason I use pdftotext.

pdftohtml convert pdf to html. simple usage is
pdftohtml yourpdffile.pdf
You will get your html file. But it is a bit plain as they just extract text from it. It there is image inside pdf, or you pdf is pretty complicated, like Malaysian Hansard. You can use the -c
pdftohtml -c yourpdffile.pdf
Here is the catch, it will generate 1 html per page in the pdf, with images. But the layout is maintained. For document like Malaysian Hansard, it would be hundreds of page.

Then there is way to produce xml
pdftohtml -xml yourpdffile.pdf
You will get an xml file which the position information.

p.s I'm using this for http://opendataday.org/ Whether there will be result today. 
 
 

No comments:

Post a Comment