Blog Backup

Recent news about Dropbox removing its Public folder feature reminded me to do my every-other-month blog backup. Wordpress provides a method to “export” the blog’s text and metadata in their XML-ish format, so you can (presumably) import your blog into another WordPress instance on the server of your choice. However, the XML file (actually, ten of ’em, all tucked into a paltry 8 MB ZIP file) does not include the media files referenced in the posts, which makes sense.

Now, being that type of guy, I have the original media files (mostly pictures) tucked away in a wide variety of directories on the file server. The problem is that there’s no easy way to match the original file to the WordPress instance; I do not want to produce a table by hand.

Fortunately, the entry for each blog post labels the URL of each media file with a distinct XML tag:


Note the two leading tabs: it’s prettyprinted XML. (Also, should you see escaped characters instead of < and >, then WordPress has chewed on the source code again.)

While I could gimmick up a script (likely in Python) to process those files, this is simple enough to succumb to a Bash-style BFH:

grep attachment_url *xml > attach.txt
sed 's/^.*http/http/' attach.txt | sed 's/&lt;\/wp.*//' > download.txt
wget --no-verbose --wait=5 --random-wait --force-directories --directory-prefix=/where/I/put/WordPress/Backups/Media/ -i download.txt

That fetches 6747 media files = 1.3 GB, tucks them into directories corresponding to their WordPress layout, and maintains their original file dates. I rate-limited the download to an average of 5 s/file in the hope of not being banned as a pest, so the whole backup takes the better part of ten hours.

So I wind up blowing an extra gig of disk space on a neatly arranged set of media files that can (presumably) be readily restored to another WordPress instance, should the occasion arise.

Memo to Self: investigate applying the -r option to the base URL, with the -N option to make it incremental, for future updates.

3 thoughts on “Blog Backup

  1. You could replace the duplicates with links, if the wasted GB irks you. Someone probably has a fslint 3-liner to do that.
    For local backups I have been playing with rsnapshot, an rsync wrapper. It produces full backups that space-wise are actually incremental backups: unchanged files get linked to an older backup.

    1. Nah, disk space is essentially free these days. In any event, having the actual files with their WordPress-style names in the proper places makes up for everything else!

      The file server has been running rsync ever since I found out about it, backing the internal drive to an external drive that holds nothing but backups. There’s a new 2 TB disk sitting on the floor, waiting for a Round Tuit that will upgrade the server to a less ancient Ubuntu server version, because the existing drive should be aging out any year now…

Comments are closed.