Cropping Images in a PDF

For reasons not relevant here, I had a PDF made from scanned page images with far too much whitespace around the Good Stuff. As with all scanned pages, the margins contain random artifacts that inhibit automagic cropping, so manual intervention was required.

Extract the images as sequentially numbered JPG files:

pdfimages -j mumble.pdf mumble

Experimentally determine how much whitespace to remove, then:

for f in mumble-0??.jpg ; do convert -verbose $f -shave 225x150 ${f%%.*}a.jpg ; done

You could use mogrify to shave the images in-place. However, not modifying the files simplifies the iteration process by always starting with the original images.

Stuff the cropped images back into a PDF:

convert mumble-0??a.jpg mumble-shaved.pdf

Profit!

Advertisements

  1. #1 by Vedran on 2016-11-18 - 08:45

    If only you posted this couple of weeks ago :)

    • #2 by Ed on 2016-11-18 - 08:59

      I sorta-kinda knew about Imagemagick’s -shave option: now we both know and, perhaps, can find it more easily the next time.

      Even though I know Imagemagick can do what I need, I must start by finding somebody who did something close to what I want, follow their hint into the huge list of options, rummage around to find the true syntax, blow half an hour fiddling with values, and finally finish the original task. Sometimes, an interruption derails me enough to lose track of what I started out to do, so writing the results here lets me short-circuit that whole process the next time around…

  2. #3 by Keith Neufeld on 2016-11-18 - 10:24

    I’m surprised to see you running scanned images through JPEGs — I find the spray of errant pixels around text and line drawings quite objectionable and I always scan to PNGs (and use them as intermediate images for any processing). Perhaps this was a collection of photographs?

    • #4 by Frans on 2016-11-18 - 11:19

      Partially depends on the JPEG quality setting, but most likely it’s just a copier function you’ve got little control over. If you’re lucky you can get TIFF or some such out of it…

    • #5 by Ed on 2016-11-18 - 11:33

      The PDF wasn’t a free variable: it came with scanned page images in whatever format PDFs use internally. Extracting as PPM or JPG made no eyeballometric difference, despite the JPG files being 256 kB and the PPM over 11 MB, so JPG it was!

      • #6 by Keith Neufeld on 2016-11-18 - 12:14

        PPMs are egregiously inefficient, to quote the spec. I use PNGs where possible — they’re lossless and usually compress as well as JPEGs, sometimes better — but I’m glad you found something that works for you.

        • #7 by Ed on 2016-11-18 - 14:08

          I use PNGs where possible

          Aye!

          The gotcha: pdfimage spits out PPM by default or JPG and that’s exactly all the choice you (well, I) get.

          • #8 by Frans on 2016-11-18 - 16:04

            Hm? I assure you there’s a -png option as well. See, e.g., http://fransdejonge.com/2014/10/fixing-up-scanned-pdfs-with-scan-tailor/

            • #9 by Ed on 2016-11-18 - 17:31

              Must be a different version, as this 0.24.5 offers only -j. I do RTFM, although I’ll grant that I should bump this box to something more recent than Xubuntu 14.04.

            • #10 by Frans on 2016-11-19 - 04:22

              Weird. When I wrote that blog post I would’ve been using Debian Jessie (in testing) or possibly still Debian Wheezy (but likely with wheezy-backports). That means something in between 0.18 and 0.26 for poppler-utils rather than the current 0.44 in Xubuntu 16.10 and 0.48 in Debian Stretch (testing).

            • #11 by Frans on 2016-11-19 - 04:24

              Ah, I know. It must be some compiled-time option or library availability issue. :)

            • #12 by Ed on 2016-11-19 - 09:04

              Must. Stifle. Comment.

  3. #13 by scruss2 on 2016-11-19 - 13:32

    There are, naturally, two branches of pdfimages with very slightly different features. Both support -j, which essentially just hoiks out any DCT-encoded images it finds in the PDF page stream. Not all JPEGs in PDF files can be viewed as such, for inane compatibility reasons. But a combination of a simple JPEG→PDF bundling program like https://github.com/josch/img2pdf allows you to archive JPEGs complete with metadata in a PDF that’s actually viewable, and you’ll be able to recover your files intact if you need ’em later.

    If you’re looking to clean up scanned notes, Matt Zucker’s noteshrink.py (https://mzucker.github.io/2016/09/20/noteshrink.html) can do an uncannily good job of image segmentation and background denoising.

    • #14 by Ed on 2016-11-20 - 08:49

      clean up scanned notes

      Now that is a neat hack! I’ll try running my shop doodles through that; the current (manual) results aren’t all that good.

      Of course, it requires a version of numpy higher than I can get with Xubuntu 12.04, so this is just another reason why I must update this box.

      Thanks for the pointer!

    • #15 by Frans on 2016-11-20 - 09:03

      Interesting. It’s like unpaper but seems to be better at color.