Fixing LibreOffice Document Graphic File Paths

It turns out that if you put convenient symlinks in your directories, then use them to build a LibreOffice document, LO will cheerfully put those paths into the graphic file links inside its XML files. That will produce horrible breakage on a new system without those links. We’ve come to the conclusion that the only way to keep LO happy is to create a Pictures directory in whatever directory holds the document file, then put all of the document’s image files into that directory, and make sure LO stores relative paths. Of course, this leaves us with the prospect of updating a whole bunch of existing (and, alas, horribly broken) documents by hand, which is unappealing. My previous solution worked for a single file, but now it’s time for some scripting…

This would probably be easier in Python, but Bash works fine after you get the quoting straightened out. This script builds several other scripts that actually do the heavy lifting, because that way you can inspect the scripts before running them to verify that you’re not about to make a bad situation much, much worse. I recommend copying the presentations into another directory, running this script, check the output scripts, run them by hand, and then copy the fixed files and the Pictures directory back where they belong.

You must tweak the actual paths to the pictures to match your situation; for these documents, one simple change sufficed for all the image files. Those paths are not variables, because I can barely keep the quoting straight without adding another layer of indirection. Make sure all the paths match up, verify the scripts before you run them, and don’t trust anything you see.

CAUTION: It’s highly likely that the multiple levels of character escaping required to make these listings appear correctly on the screen will produce incorrect results when copied-and-pasted. You can download the script file as FixGraphics.sh.odt, which is a bare-ASCII TXT file (which you must rename to eliminate the ODT extension, then make executable as a shell script), to see how it compares.

The main FixGraphics.sh script, with some key lines highlighted:

#!/bin/bash

echo "Extract list of images from all ODP files"
rm images.txt
for f in *odp
do
	unzip -p "$f" content.xml | sed 's/></>\n</g' | grep Cameras | cut -d \" -f 2 | sort -u >> images.txt
done

echo "Make source file name list"
# strip off leading relative pathing, set actual absolute path, un-quote blanks and special characters, add quotes
sed 's/..\/..\/..\/../\/mnt/' images.txt | sed 's/%20/ /g' | sed 's/&amp;/\&/g' | sed 's/^.*/\"&\"/' > source.lst

echo "Make target file name list"
# set relative to current directory
sed 's/\/mnt\/bulkdata\/Cameras\/MCWN/\.\/Pictures/' source.lst > target.lst

echo "Make target directory list"
# must add trailing quote stripped by dirname
rm dirs.lst
cat target.lst | while read tline ; do
	tdir=`dirname "$tline"`
	echo ${tdir}\"
done > dirs.lst

echo "Create target directory structure script"
rm mkdirs.sh
sort -u dirs.lst | while read dline ; do
	echo mkdir --parents ${dline}
done > mkdirs.sh
chmod u+x mkdirs.sh

echo "Create image file copy script"
rm cpjpgs.sh
cat dirs.lst | while read dline ; do
	echo cp -n -t ${dline}
done > cptemp.txt
paste cptemp.txt source.lst > cpjpgs.sh
chmod u+x cpjpgs.sh

echo "Create ODP fixup script"
echo "for f in *odp ; do" > fixodp.sh
echo "unzip -p \"\$f\" content.xml > raw.xml" >> fixodp.sh
echo "sed 's/..\/..\/..\/..\/bulkdata\/Cameras\/MCWN/\.\.\/Pictures/g' raw.xml > content.xml"  >> fixodp.sh
echo "zip \"\$f\" content.xml"  >> fixodp.sh
echo "done" >> fixodp.sh
echo "rm raw.xml content.xml" >> fixodp.sh
chmod u+x fixodp.sh

Run mkdirs.sh, cpjpgs.sh, and fixodp.sh: then it Just Works.

Some of the tricky parts:

The content.xml file may be stored in unformatted mode, with everything mushed together into one huge line. To make it readable and parse-able, insert a newline between each pair of adjoining angle brackets:

sed 's/></>\n</g'

This burst of line noise un-escapes the file name from the way LO stores it internally. Note that the middle sed command really does have the literal escape sequence ampersand-amp-semicolon in it and the ampersand in the last one is the sed-ism for “the whole matching string”:

sed 's/%20/ /g' | sed 's/&amp;/\&/g' | sed 's/^.*/\"&\"/'

The difference between these two sed strings indicates the actual relative path to the Pictures subdirectory in the filesystem and the faked relative path from the LO pseudo-subdirectory where the document stores its internal state. The string of periods in the second command shows what LO stored for the original files in our documents; your mileage will certainly differ:

sed 's/\/mnt\/bulkdata\/Cameras\/MCWN/\.\/Pictures/' source.lst > target.lst
sed 's/..\/..\/..\/..\/bulkdata\/Cameras\/MCWN/\.\.\/Pictures/' raw.xml > content.xml

I don’t know how they could make the file linkages work better, but it’d be really nice if there were a less horrible way to fix the breakage.

22 thoughts on “Fixing LibreOffice Document Graphic File Paths

  1. Yeah, it’s hacky, but it does the job and lets you get on with your day. Because I apparently have some odd form of brain damage, I would have used XSLT to tweak the XML. And multiple escaping in XSLT is a truly ugly beast.

    1. lets you get on with your day

      That happened after the steam stopped coming out of my ears…

      1. after the steam

        I got that in my project to turn the old Sony Vaio back into a Linux box. I went to uninstall Microsoft Works and the install wizard wouldn’t let me do it until I found and inserted the CD. Really, Microsoft? You expect no one to uninstall your low end programs? I hate bugs, and lazy software testers… (Giving them the benefit of the doubt. Failing that, #sendinthedrones)

        Good luck with the storm. We’re getting a little cold snap, +7 F now, -3 F for tomorrow morning. Whee.

        1. until I found and inserted the CD

          How very nineties!

          The fireproof safe in the garage has all our program distribution CDs, although I have my doubts about how many will be readable in the (decreasingly likely) event that I need them. At one point I’d ripped the CDs to ISO images on a big hard drive, but haven’t kept up with that.

          In fact, I recently discovered that I’m most of the way through the last spindle of the 700-odd blank CDs I’d bought back in the day. Don’t know if I should buy more, having just picked up a bunch of 4 GB USB drives for booting Linux distros and suchlike…

          1. If you didn’t mind having your documents trapped in a format that no other program could import… [sigh]

            Been there, done that, won’t get fooled again.

        2. That’s true. If I can I only use ODT, HTML, and Latex — the latter of which I’m only experimenting with, but it’s looking pretty good. In Debian and derivatives what you need are texlive and an editor, like Texmaker. You can also export from ODT — or anything else Writer imports — to reasonably clean LaTeX, provided you choose a clean option. You can export from LaTeX to pretty decent HTML with hevea. And in turn you can do a decent import in Writer from that generated HTML, which works better than the straight LaTeX to ODT options I tried.

          1. Latex … I’m only experimenting with, but it’s looking pretty good

            I highly recommend LyX as a quasi-WYSIWYG front end for LaTeX, which is what I’ve been using to produce the Trinity robot contest rules. Trying to do large documents with a word process will drive you (well, me) mad, but Lyx Just Works: the same document produces PDF (in both Letter and A4) and the HTML version for the website.

            Well, mostly it Just Works. There’s a lot of software stacked up between the source and PDF, so when the process falls off the rails the error messages don’t really help much at all.

        3. Oh, and I suppose I should add: for printing HTML I use Prince. In fact I did a little experiment last year of adding some extra CSS to a Gutenberg book from the 1860s and within a couple dozen minutes I ended up with a fantastic looking little book, although for some reason I never did let Lulu print it.

        4. Yep, that’s exactly why I’m experimenting with LaTeX. I don’t like working with Writer terribly much, although I definitely like it better than MS Word, which secretly invents new styles while you’re not looking and never manages to number tables and illustrations right without the paying the utmost attention — I might as well update the numbers myself if I have to right-click > update everything. I once even wrote a Visual Basic script to “manually” update all such automatic references, but I’ve since lost it.

          I looked at Lyx, but I was a bit disappointed to find that it used its own format rather than being a more straightforward LaTeX frontend, so I figured I might as well skip the effort of learning Lyx by going straight to e.g. Texmaker or Kile. I basically hoped its WYSIWYM mechanism would work much the same way as a good HTML editor. However, at least for now I find it more distracting than authoring straight HTML, so I might be giving Lyx another chance yet.

          Also I actually ran into a kind of LaTeX disappointment surprisingly quickly:

          
          % See http://en.wikibooks.org/wiki/LaTeX/Tables#Text_wrapping_in_tables
          \begin{tabular}{| p{0.5\textwidth} | p{0.5\textwidth} |}
          \hline
          Jaren ‘60	&	Jaren ‘70\\\hline
          Welvaart, consumptie en vrije tijd & Crisis van de westerse verzorgingstaat en economische crisis\\\hline
          Bruisende jaren (alles was in beweging) & Matte jaren (stagnatie, uitzichtloosheid)\\\hline
          Geloof in de vooruitgang & Jaren van het conservatisme\\\hline
          Dominantie van links = Morele regels staan centraal\\\hline
          Idealisme en bevlogenheid = politieke Yippie & Pragmatisme = Yuppie (Young urban professional)\\\hline
          Engagement & Groot individualisme\\\hline
          Maatschappelijk werk staat centraal & Maatschappelijke desinteresse = Subject en mens staan centraal\\\hline
          \end{tabular}
          

          I had to figure out something seemingly fairly advanced to do something seemingly very basic. That’s not what I had in mind when people were telling me you could just write what you mean and let LaTeX worry about the looks.

          It looks like customizing styles may be significantly more complicated than CSS, so I’m not quite sure what I think of that yet. But like I said, the fact that I get easy export to decent-enough HTML through hevea, latex2html or eLyXer (thanks for that last one!), it’s very safe for me to experiment.

          1. just write what you mean and let LaTeX worry about the looks.

            I had a lot of trouble letting go of the looks. You pick a template and that’s what your text will become: Lyx / LaTeX takes control and that’s just the way it is. If you can’t find a suitable template, then it’s just the wrong hammer for the job, because the amount of manual fiddling will drive you crazy.

            Forcing produce a particular look seems fraught with peril, although I have used packages to handle things like wrapping long URLs. I made a few feeble attempts to conjure up a package, but quickly realized that I simply don’t want to learn enough to do that. Or, perhaps, that I can’t learn enough to do that. [wince]

            Of course, stacking up a dozen packages makes the error messages even more impenetrable…

        5. I understand it should be marginally easier to change certain aspects of the looks by using plain LaTeX than by using LyX. The thing is I also want to move up to experimenting with BibTex references, which should actually simplify certain things a lot, but there’ll be style guides to be obeyed.

        6. LyX now also has a wrap table float

          Good to hear. It doesn’t really matter since this was kind of one of those one-time learning experiences. Or perhaps rather one of those things I now know where to find if I run into them — I doubt I’ll remember the syntax precisely for now. However, it was completely counter to the things people had been saying. In HTML/Prince the same table would’ve been perfectly fine without any column-size settings. Now I’ll admit that in Prince obtaining quite as nice a document would require me to write some CSS and perhaps some JS, but I can do that really, really quickly and besides I can copy over past work. I think Prince also has proper kerning support and the like (better than LO Writer as far as I can tell), but I don’t think it organizes line breaking quite as efficiently, although I’d have to do some tests.

          In any case I’m baffled at just how bad Writer and Word are at actually making text look good on paper.

          1. require me to write some CSS and perhaps some JS

            See, that’s the sort of craziness that leads directly downslope, ending with hand-coded Postscript: when you care enough to position every character. [grin]

            Using Adobe Framemaker to lay out my book burned out most of my craziness. For the Trinity rules, I admit to adding a few specialized packages to control things like URLs, but the result looks pretty much like what the standard layout prescribes… soooo, the rules look exactly like a LaTeX book.

            Of course, Knuth wrote TeX specifically to solve his page layout problem, so you’re part of a long tradition: folks who can’t stand ugly page layout and are crazy enough to do something about it!

        7. See, that’s the sort of craziness that leads directly downslope, ending with hand-coded Postscript: when you care enough to position every character. [grin]

          No no, I just mean that e.g. headings in HTML don’t do automatic numbering without generated content and indexes don’t generate themselves without some simple Javascript.

          It is in fact also possible to implement LaTeX-like line-breaking through some JS, but I’m not sure if Prince supports advanced-enough Javascript for that. But perhaps it already does it by itself. I really should run some tests sometime.

          Of course, Knuth wrote TeX specifically to solve his page layout problem, so you’re part of a long tradition: folks who can’t stand ugly page layout and are crazy enough to do something about it!

          You’d just think that something like Microsoft Word would typeset beautifully, but instead it’s no better or perhaps even worse than what I got from Microsoft Works 3.0 back in the ’90s.

          It’s surprisingly similar to the situation with audio players: most don’t do it very well at all!

        8. I’ll give you some examples.

          HTML (straight from Gutenberg with only some minor CSS tweaks for nice Prince output — you can see how simpe it all is)
          PDF (margins are kind of awful because more pages cost more money, hehe. Easy-peasy to play around with, of course)

          So if the HTML is semi-decent, which includes the output from certain LaTeX to HTML converters, I can get Prince to output a nice-looking PDF in minutes. However, LaTeX is a bit smarter in certain aspects (not tables!), plus it comes with the whole BibTex stuff. Unfortunately styling is nowhere near as easy to tweak as CSS.

          1. Wow, when that guy got rolling he just couldn’t shut off a paragraph. Page 18 of the PDF presents the biggest wall of text I’ve seen in a while; even adding a bit of space between paragraphs wouldn’t salvage that one!

            Looks better than my simpleminded Gutenberg books. No argument about that… [wince]

        9. Oh, I’ll have to take a picture of the book I’m reading right now. It’s inspired by Joyce, so it’s just a gigantic stream of consciousness. It switches around between place and time constantly, much like The Sound And The Fury, but without the helpful italics. *grins*

          Anyway, the really nice thing is that while it probably took me about half an hour to fix those styles up quite the way I wanted them, now the style and script can be transplanted in any Gutenberg HTML and presumably result in a proper PDF within minutes. Thus any classic is ready for Lulu reprint at roughly $10+shipping almost instantly. Except like I said, I haven’t actually tested Lulu yet.

        10. Alright, here’s some information about the book and here’s a picture. It basically goes on for that the entire book, except where there’s dialog. The scribblings could be 40 years old and on this very page I found out they might’ve been have been made by a foreigner.

          The annoying past reader of the book* occasionally marked when it switches from one time and place to another (sensible), but also seems to have a strange fascination with references to religion and seemingly completely random things. But on page 65 the word “zweep” (whip) is marked and they wrote “whip” in the margin. So presumably the completely random markings were words they didn’t know.

          * Come on, scribble in your own books. Leave the poor library books alone!

        11. As an update, due to the conversation here I decided to figure out BibTeX today. Apparently BibLaTeX is a better alternative, so after getting BibTeX to work I went with that instead. I actually only found out about BibLaTeX after searching if there perhaps was some package that provided commands like \citetitle. Unfortunately these nice bibliographic systems seem to break the HTML converters — even the comparatively primitive BibTeX does.

          Also, the biblatex and biber packages are not installed yet by the texlive metapackage in Debian and Ubuntu. In Debian Squeeze biber isn’t even packaged.

          I think I’ve got most of basics figured out now.

  2. I basically hoped its WYSIWYM mechanism would work much the same way as a good HTML editor.

    Whoops, my thoughts became a bit muddled there. I meant I hoped Lyx was to LaTeX as a WYSIWYM HTML editor is to HTML.

Comments are closed.