$ wget https://www.example.com/ \ --mirror \ --page-requisites \ --convert-links \ --adjust-extension \ --exclude-directories="feed,*/feed/,*/*/feed,wp-json,search" \ --reject-regex="\/\?(s|replytocom)=*" \ --reject=php,xml
wget downloads files from the Web
If you are unfamiliar with
wget tool manages the links between files and converts them to relative links once all files have been downloaded. With relative links, we can open the local HTML files in a browser and browse the local version of the site as usual. It also means the files can be uploaded to a default Apache web server, and the site will work.
These features make
wget useful for archiving sites. For example, when working with a website for a yearly festival, I create a copy of the site after a festival has been held and before the website is updated with information for the next festival and make it available on a subdomain. The festival team now have a reference to past festival information and can confidently edit their primary site as they prepare for the next festival.
Options to copy an entire site
The three critical options needed to create a static HTML copy of a website with
--mirrorThis will make Wget continue to follow links within the site and download every page it encounters. In other words, it sets an ‘infinite recursion depth’.
--convert-linksWill kick in after all files have been downloaded and covert links within the HTML documents to relative links. This ensures hyperlinks in the local HTML file link to other local HTML files and not the original online versions, and the same with links to images, CSS, JS, etc.
Exclude dynamic or unwanted content
Our command here uses options to exclude and reject certain directories and file types.
--exclude-directoriesI use this option to exclude RSS feeds and REST API pages. WordPress includes links to these resources in pages by default; most sites leave this unaltered. I am creating archives for people to browse, so I exclude these machine-readable formats.
--reject-regexI use this option to bypass links to search result pages and links to reply to comments. By default, wget rightfully treats URLs with query parameters as a new URL and will create a local file for each URL variation encountered. In this WordPress context, I found this to create unwanted duplicate files.
--rejectThis option is a list of file suffixes to bypass. I list PHP and XML files here because, again, I’m opting for an archive that a person can browse and PHP and XML files are outside that use case.
You can safely remove any of these options from your command and successfully create an archive copy of your site. Without these options, more files will be create which will increase the size of your archive.
caffeinate to avoid macOS going to sleep
The larger the site you are archiving, the longer it will take wget to download all the pages and requisite files. If you start the command and leave your computer unattended then the system may go to sleep and stop wget from creating an archive.
To avoid the system going to sleep while your command is running, you can prefix it with
caffeinate and your system will keep running while they command runs. For example;
$ caffeinate wget [option]... [URL]...