Archive for January 9th, 2010

Extracting Text from PDF Documents

Mary had to extract an extensive table from a PDF (think financial statements) and found that a simple cut-and-paste failed with mysterious symptoms. I’ve had that happen, too, as PDF documents sometimes have a complete disconnect between the rendered page and the original text; sometimes you can’t even select the text block you want without copying huge chunks of the surrounding document or pasting meaningless junk.

Easy solution: feed the PDF into pdftotext and extract the table from the ensuing flat text file.

It’s a command-line thing:

pdftotext -layout whatever.pdf

That produces whatever.txt with the ASCII characters bearing more-or-less the same spatial arrangement as the original PDF, minus all the font and graphic frippery. It tends to insert a ton of blanks in an attempt to make the formatting come out right, which may not be quite what you want.

Omitting the -layout option gives you something vaguely resembling the PDF, although precisely arranged tables tend to fare poorly.

If you have a bazillion-page PDF document and need the text from just a page or two, feed it into the pdftk brush chipper, extract the appropriate pages, and then run those files through pdftotext. You can probably get similar results using just pdftk, but pdftotext seems to work better on the files I’ve had to deal with.

This is a GNU/Linux thing; the programs are likely part of your favorite distribution; follow the links if not. If you’re still using Windows, maybe they’ll work for you, but maybe it’d be easier to just go buy something similar.