Recent news about Dropbox removing its Public folder feature reminded me to do my every-other-month blog backup. Wordpress provides a method to “export” the blog’s text and metadata in their XML-ish format, so you can (presumably) import your blog into another WordPress instance on the server of your choice. However, the XML file (actually, ten of ’em, all tucked into a paltry 8 MB ZIP file) does not include the media files referenced in the posts, which makes sense.
Now, being that type of guy, I have the original media files (mostly pictures) tucked away in a wide variety of directories on the file server. The problem is that there’s no easy way to match the original file to the WordPress instance; I do not want to produce a table by hand.
Fortunately, the entry for each blog post labels the URL of each media file with a distinct XML tag:
Note the two leading tabs: it’s prettyprinted XML. (Also, should you see escaped characters instead of
>, then WordPress has chewed on the source code again.)
While I could gimmick up a script (likely in Python) to process those files, this is simple enough to succumb to a Bash-style BFH:
grep attachment_url *xml > attach.txt sed 's/^.*http/http/' attach.txt | sed 's/<\/wp.*//' > download.txt wget --no-verbose --wait=5 --random-wait --force-directories --directory-prefix=/where/I/put/WordPress/Backups/Media/ -i download.txt
That fetches 6747 media files = 1.3 GB, tucks them into directories corresponding to their WordPress layout, and maintains their original file dates. I rate-limited the download to an average of 5 s/file in the hope of not being banned as a pest, so the whole backup takes the better part of ten hours.
So I wind up blowing an extra gig of disk space on a neatly arranged set of media files that can (presumably) be readily restored to another WordPress instance, should the occasion arise.
Memo to Self: investigate applying the
-r option to the base URL, with the
-N option to make it incremental, for future updates.