sweemeng's tech adventure: Converting PDF to Text with pdftohtml

Saturday, December 03, 2011

Converting PDF to Text with pdftohtml

Previously I have tried to extract pdf information by converting PDF to text, as described here.

Problem is, a big wall of text is very hard to process.
Here come pdftohtml it is part of the poppler package on linux. But gnuwin do not have it for windows. Which is one reason I use pdftotext.

pdftohtml convert pdf to html. simple usage is

pdftohtml yourpdffile.pdf

You will get your html file. But it is a bit plain as they just extract text from it. It there is image inside pdf, or you pdf is pretty complicated, like Malaysian Hansard. You can use the -c

pdftohtml -c yourpdffile.pdf

Here is the catch, it will generate 1 html per page in the pdf, with images. But the layout is maintained. For document like Malaysian Hansard, it would be hundreds of page.

Then there is way to produce xml

pdftohtml -xml yourpdffile.pdf

You will get an xml file which the position information.

p.s I'm using this for http://opendataday.org/ Whether there will be result today.

sweemeng's tech adventure

Saturday, December 03, 2011

Converting PDF to Text with pdftohtml

No comments:

Post a Comment