About these ads

Extracting Text from PDF Documents

Mary had to extract an extensive table from a PDF (think financial statements) and found that a simple cut-and-paste failed with mysterious symptoms. I’ve had that happen, too, as PDF documents sometimes have a complete disconnect between the rendered page and the original text; sometimes you can’t even select the text block you want without copying huge chunks of the surrounding document or pasting meaningless junk.

Easy solution: feed the PDF into pdftotext and extract the table from the ensuing flat text file.

It’s a command-line thing:


pdftotext -layout whatever.pdf

That produces whatever.txt with the ASCII characters bearing more-or-less the same spatial arrangement as the original PDF, minus all the font and graphic frippery. It tends to insert a ton of blanks in an attempt to make the formatting come out right, which may not be quite what you want.

Omitting the -layout option gives you something vaguely resembling the PDF, although precisely arranged tables tend to fare poorly.

If you have a bazillion-page PDF document and need the text from just a page or two, feed it into the pdftk brush chipper, extract the appropriate pages, and then run those files through pdftotext. You can probably get similar results using just pdftk, but pdftotext seems to work better on the files I’ve had to deal with.

This is a GNU/Linux thing; the programs are likely part of your favorite distribution; follow the links if not. If you’re still using Windows, maybe they’ll work for you, but maybe it’d be easier to just go buy something similar.

About these ads
  1. #1 by Hrap on 9-January-2010 - 16:04

    thanks for the tip!

  2. #2 by Memo on 8-May-2013 - 18:37

    In fact you do not even need pdftk to extract text from only some pages of a large PDF. pdftotext accepts the parameters -f and -l to specify the first and last page to process.

    • #3 by Ed on 8-May-2013 - 20:29

      There’s a command-line switch for everything!

      The only advantage of pdftk is that you can specify discontiguous sets of pages, but I think we didn’t need that much flexibility. Next time we must do that sort of thing, I’ll give the pdftotext parameters a try…

      Thanks for the tip!

Comments, thoughts, notes, corrections: what do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s