The Smell of Molten Projects in the Morning

Ed Nisley's Blog: Shop notes, electronics, firmware, machinery, 3D printing, laser cuttery, and curiosities. Contents: 100% human thinking, 0% AI slop.

Cropping Images in a PDF

For reasons not relevant here, I had a PDF made from scanned page images with far too much whitespace around the Good Stuff. As with all scanned pages, the margins contain random artifacts that inhibit automagic cropping, so manual intervention was required.

Extract the images as sequentially numbered JPG files:

pdfimages -j mumble.pdf mumble

Experimentally determine how much whitespace to remove, then:

for f in mumble-0??.jpg ; do convert -verbose $f -shave 225x150 ${f%%.*}a.jpg ; done

You could use mogrify to shave the images in-place. However, not modifying the files simplifies the iteration process by always starting with the original images.

Stuff the cropped images back into a PDF:

convert mumble-0??a.jpg mumble-shaved.pdf

Profit!

Comments

15 responses to “Cropping Images in a PDF”

  1. Vedran Avatar

    If only you posted this couple of weeks ago :)

    1. Ed Avatar

      I sorta-kinda knew about Imagemagick’s -shave option: now we both know and, perhaps, can find it more easily the next time.

      Even though I know Imagemagick can do what I need, I must start by finding somebody who did something close to what I want, follow their hint into the huge list of options, rummage around to find the true syntax, blow half an hour fiddling with values, and finally finish the original task. Sometimes, an interruption derails me enough to lose track of what I started out to do, so writing the results here lets me short-circuit that whole process the next time around…

  2. Keith Neufeld Avatar
    Keith Neufeld

    I’m surprised to see you running scanned images through JPEGs — I find the spray of errant pixels around text and line drawings quite objectionable and I always scan to PNGs (and use them as intermediate images for any processing). Perhaps this was a collection of photographs?

    1. Frans Avatar

      Partially depends on the JPEG quality setting, but most likely it’s just a copier function you’ve got little control over. If you’re lucky you can get TIFF or some such out of it…

    2. Ed Avatar

      The PDF wasn’t a free variable: it came with scanned page images in whatever format PDFs use internally. Extracting as PPM or JPG made no eyeballometric difference, despite the JPG files being 256 kB and the PPM over 11 MB, so JPG it was!

      1. Keith Neufeld Avatar
        Keith Neufeld

        PPMs are egregiously inefficient, to quote the spec. I use PNGs where possible — they’re lossless and usually compress as well as JPEGs, sometimes better — but I’m glad you found something that works for you.

        1. Ed Avatar

          I use PNGs where possible

          Aye!

          The gotcha: pdfimage spits out PPM by default or JPG and that’s exactly all the choice you (well, I) get.

          1. Frans Avatar

            Hm? I assure you there’s a -png option as well. See, e.g., http://fransdejonge.com/2014/10/fixing-up-scanned-pdfs-with-scan-tailor/

            1. Ed Avatar

              Must be a different version, as this 0.24.5 offers only -j. I do RTFM, although I’ll grant that I should bump this box to something more recent than Xubuntu 14.04.

            2. Frans Avatar

              Weird. When I wrote that blog post I would’ve been using Debian Jessie (in testing) or possibly still Debian Wheezy (but likely with wheezy-backports). That means something in between 0.18 and 0.26 for poppler-utils rather than the current 0.44 in Xubuntu 16.10 and 0.48 in Debian Stretch (testing).

            3. Frans Avatar

              Ah, I know. It must be some compiled-time option or library availability issue. :)

            4. Ed Avatar

              Must. Stifle. Comment.

  3. scruss2 Avatar

    There are, naturally, two branches of pdfimages with very slightly different features. Both support -j, which essentially just hoiks out any DCT-encoded images it finds in the PDF page stream. Not all JPEGs in PDF files can be viewed as such, for inane compatibility reasons. But a combination of a simple JPEG→PDF bundling program like https://github.com/josch/img2pdf allows you to archive JPEGs complete with metadata in a PDF that’s actually viewable, and you’ll be able to recover your files intact if you need ’em later.

    If you’re looking to clean up scanned notes, Matt Zucker’s noteshrink.py (https://mzucker.github.io/2016/09/20/noteshrink.html) can do an uncannily good job of image segmentation and background denoising.

    1. Ed Avatar

      clean up scanned notes

      Now that is a neat hack! I’ll try running my shop doodles through that; the current (manual) results aren’t all that good.

      Of course, it requires a version of numpy higher than I can get with Xubuntu 12.04, so this is just another reason why I must update this box.

      Thanks for the pointer!

    2. Frans Avatar

      Interesting. It’s like unpaper but seems to be better at color.