$ wget https://www.example.com/ \
--mirror \
--page-requisites \
--convert-links \
--adjust-extension \
--exclude-directories="feed,*/feed/,*/*/feed,wp-json,search" \
--reject-regex="\/\?(s|replytocom)=*" \
--reject=php,xml
wget
downloads files from the Web
If you are unfamiliar with wget
, it is a command-line tool for requesting files from the web and saving them locally to your computer. It can follow links in HTML documents, like a web spider, and download media, CSS, and JavaScript files needed to create a “mirror” or copy of the site.
The wget
tool manages the links between files and converts them to relative links once all files have been downloaded. With relative links, we can open the local HTML files in a browser and browse the local version of the site as usual. It also means the files can be uploaded to a default Apache web server, and the site will work.
These features make wget
useful for archiving sites. For example, when working with a website for a yearly festival, I create a copy of the site after a festival has been held and before the website is updated with information for the next festival and make it available on a subdomain. The festival team now have a reference to past festival information and can confidently edit their primary site as they prepare for the next festival.
Options to copy an entire site
The three critical options needed to create a static HTML copy of a website with wget
--mirror
--page-requisites
--convert-links
--mirror
This will make Wget continue to follow links within the site and download every page it encounters. In other words, it sets an ‘infinite recursion depth’.--page-requisites
This causes Wget to download additional files used by the HTML document, such as images, CSS, JavaScript, etc., which are needed for the copy to display the same as the original.--convert-links
Will kick in after all files have been downloaded and covert links within the HTML documents to relative links. This ensures hyperlinks in the local HTML file link to other local HTML files and not the original online versions, and the same with links to images, CSS, JS, etc.
Exclude dynamic or unwanted content
Our command here uses options to exclude and reject certain directories and file types.
--exclude-directories
I use this option to exclude RSS feeds and REST API pages. WordPress includes links to these resources in pages by default; most sites leave this unaltered. I am creating archives for people to browse, so I exclude these machine-readable formats.--reject-regex
I use this option to bypass links to search result pages and links to reply to comments. By default, wget rightfully treats URLs with query parameters as a new URL and will create a local file for each URL variation encountered. In this WordPress context, I found this to create unwanted duplicate files.--reject
This option is a list of file suffixes to bypass. I list PHP and XML files here because, again, I’m opting for an archive that a person can browse and PHP and XML files are outside that use case.
You can safely remove any of these options from your command and successfully create an archive copy of your site. Without these options, more files will be create which will increase the size of your archive.
Use caffeinate
to avoid macOS going to sleep
The larger the site you are archiving, the longer it will take wget to download all the pages and requisite files. If you start the command and leave your computer unattended then the system may go to sleep and stop wget from creating an archive.
To avoid the system going to sleep while your command is running, you can prefix it with caffeinate
and your system will keep running while they command runs. For example;
$ caffeinate wget [option]... [URL]...
Leave a Reply