Blog Backup: Incremental Media

The recipe for incrementally copying media files since the previous blog backup works like this:

grep attachment_url *xml > attach.txt
sed 's/^.*http/http/' attach.txt | sed 's/<\/wp.*//' > download.txt
wget -nc -w 2 --no-verbose --random-wait --force-directories --directory-prefix=Media/ -i download.txt

The -nc sets the “no clobber” option, which (paradoxically) simply avoids downloading a duplicate of an existing file. Otherwise, it’d download the file and glue on a *.1 suffix, which isn’t a desirable outcome. The myriad (thus far, 0.6 myriad) already-copied files generate a massive stream of messages along the lines of File ‘mumble’ already there; not retrieving.

Adding --no-verbose will cut the clutter and emit some comfort messages.

There seems no way to recursively fetch only newer media files directly from the WordPress file URL with -r -N; the site redirects the http:// requests to the base URL, which doesn’t know about bare media files and coughs up a “not found” error.