Saturday, October 29, 2011

Converting PDF to Text

So I have recently involved with a project to extract data from PDF. Which is actually evil, but that is not important now.

On linux there is a set of utilities comes with xpdf program. It should be part of the default package installation, if not, you just apt-get or yum it.

On windows you can go to the gunwin32 page, I just download the zip just so i would not have to remove it with a uninstaller.
http://gnuwin32.sourceforge.net/packages/xpdf.htm

I don't really need the layout information, on it. so I just use pdftotext.

On windows
program_location/pdftotext.exe -layout pdf_file.pdf

On linux, just
pdftotext -layout pdf_file.pdf

The -layout would maintain the layout of the text as from the pdf. Otherwise, the positioning for certain text will be inconsistent.

Cheers

No comments:

Post a Comment