While helpful, prior responses fail to concisely, reliably, and repeatably solve the underlying question. In this post, we briefly detail the difficulties with each and then offer a modest httrack
-based solution.
Background
Before we get to that, however, consider perusing mpy's well-written response. In h[is|er] sadly neglected post, mpy rigorously documents the Wayback Machine's obscure (and honestly obfuscatory) archival scheme.
Unsurprisingly, it ain't pretty. Rather than sanely archiving sites into a single directory, The Wayback Machine ephemerally spreads a single site across two or more numerically identified sibling directories. To say that this complicates mirroring would be a substantial understatement.
Understanding the horrible pitfalls presented by this scheme is core to understanding the inadequacy of prior solutions. Let's get on with it, shall we?
Prior Solution 1: wget
The related StackOverflow question "Recover old website off waybackmachine" is probably the worst offender in this regard, recommending wget
for Wayback mirroring. Naturally, that recommendation is fundamentally unsound.
In the absence of complex external URL rewriting (e.g., Privoxy
), wget
cannot be used to reliably mirror Wayback-archived sites. As mpy details under "Problem 2 + Solution," whatever mirroring tool you choose must allow you to non-transitively download only URLs belonging to the target site. By default, most mirroring tools transitively download all URLs belonging to both the target site and sites linked to from that site – which, in the worst case, means "the entire Internet."
A concrete example is in order. When mirroring the example domain kearescue.com
, your mirroring tool must:
- Include all URLs matching
https://web.archive.org/web/*/http://kearescue.com
. These are assets provided by the target site (e.g.,https://web.archive.org/web/20140521010450js_/http_/kearescue.com/media/system/js/core.js
). - Exclude all other URLs. These are assets provided by other sites merely linked to from the target site (e.g.,
https://web.archive.org/web/20140517180436js_/https_/connect.facebook.net/en_US/all.js
).
Failing to exclude such URLs typically pulls in all or most of the Internet archived at the time the site was archived, especially for sites embedding externally-hosted assets (e.g., YouTube videos).
That would be bad. While wget
does provide a command-line --exclude-directories
option accepting one or more patterns matching URLs to be excluded, these are not general-purpose regular expressions; they're simplistic globs whose *
syntax matches zero or more characters excluding /
. Since the URLs to be excluded contain arbitrarily many /
characters, wget
cannot be used to exclude these URLs and hence cannot be used to mirror Wayback-archived sites. Period. End of unfortunate story.
This issue has been on public record since at least 2009. It has yet to be be resolved. Next!
Prior Solution 2: Scrapbook
Prinz recommends ScrapBook
, a Firefox plugin. A Firefox plugin.
That was probably all you needed to know. While ScrapBook
's Filter by String...
functionality does address the aforementioned "Problem 2 + Solution," it does not address the subsequent "Problem 3 + Solution" – namely, the problem of extraneous duplicates.
It's questionable whether ScrapBook
even adequately addresses the former problem. As mpy admits:
Although Scrapbook failed so far to grab the site completely...
Unreliable and overly simplistic solutions are non-solutions. Next!
Prior Solution 3: wget + Privoxy
mpy then provides a robust solution leveraging both wget
and Privoxy
. While wget
is reasonably simple to configure, Privoxy
is anything but reasonable. Or simple.
Due to the imponderable technical hurdle of properly installing, configuring, and using Privoxy
, we have yet to confirm mpy's solution. It should work in a scalable, robust manner. Given the barriers to entry, this solution is probably more appropriate to large-scale automation than the average webmaster attempting to recover small- to medium-scale sites.
Is wget
+ Privoxy
worth a look? Absolutely. But most superusers might be better serviced by simpler, more readily applicable solutions.
New Solution: httrack
Enter httrack
, a command-line utility implementing a superset of wget
's mirroring functionality. httrack
supports both pattern-based URL exclusion and simplistic site restructuring. The former solves mpy's "Problem 2 + Solution"; the latter, "Problem 3 + Solution."
In the abstract example below, replace:
$
by the URL of the top-level directory archiving the entirety of your target site (e.g.,'https://web.archive.org/web/20140517175612/http://kearescue.com'
).$
by the same domain name present in$
excluding the prefixinghttp://
(e.g.,'kearescue.com'
).
Here we go. Install httrack
, open a terminal window, cd
to the local directory you'd like your site to be downloaded to, and run the following command:
httrack\ $\ '-*'\ '+*/$/*'\ -N1005\ --advanced-progressinfo\ --can-go-up-and-down\ --display\ --keep-alive\ --mirror\ --robots=0\ --user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5'\ --verbose
On completion, the current directory should contain one subdirectory for each filetype mirrored from that URL. This usually includes at least:
css
, containing all mirrored CSS stylesheets.html
, containing all mirrored HTML pages.js
, containing all mirrored JavaScript.ico
, containing one mirrored favicon.
Since httrack
internally rewrites all downloaded content to reflect this structure, your site should now be browsable as is without modification. If you prematurely halted the above command and would like to continue downloading, append the --continue
option to the exact same command and retry.
That's it. No external contortions, error-prone URL rewriting, or rule-based proxy servers required.
Enjoy, fellow superusers.