
Research
NPM targeted by malware campaign mimicking familiar library names
Socket uncovered npm malware campaign mimicking popular Node.js libraries and packages from other ecosystems; packages steal data and execute remote code.
github.com/paultraf/makestaticsite
MakeStaticSite (project site https://makestaticsite.sh/) is a set of Bash shell scripts that configure and use Wget to generate a static website from a (typically dynamic) website, with various options to tailor and deploy the output. It aims to improve the performance and security of public-facing websites, whilst allowing continuity in the way they are developed and maintained, without requiring technical know-how on behalf of users.
MakeStaticSite provides a convenient means to set up and manage the automated creation and deployment of static (or flat) versions of websites. These include content management systems (such as WordPress and Drupal) that can, for example, be administered locally and then deployed remotely to a hosting provider or Content Distribution Network (CDN).
It delivers a version of the site that preserves content and look and feel, in a static format that is inherently fast and secure. In this mode, MakeStaticSite is not intended as a strict archival tool as the output is not an exact mirror — for example, it has its own canonical layout and modifies internal links accordingly; further files may be added; RSS feeds are saved and then renamed with .xml
extensions, and so on.
Nevertheless, a couple of more recent developments have been concerned with archival. With the first, MakeStaticSite has been extended to provide native support for the Wayback Machine, focused mainly on the Internet Archive Service. Secondly, basic support has been added for generating WARC (Web ARChive) files (leveraging Wget's support) with an option to concatenate multiple archives (one for each run of Wget) as a single compressed .warc.gz
file. The result can be played back with tools such as ReplayWeb.page.
The goal is for anyone who has a little familiarity with the command line to be able to use the tool to assist in maintaining their sites. Similarly, a scripting-based approach has been chosen to make the code widely accessible for developers to further fine-tune; a number of refinements are already included that augment the standard use of Wget, such as support for arbitrary attributes and, in the case of WordPress, the use of WP-CLI to prepare sites beforehand.
MakeStaticSite is available under AGPL version 3 license. See the COPYING file for more information.
This software should work on version 3.2 of GNU Bash, though version 4+ is recommended.
MakeStaticSite depends on GNU Wget. Other requirements are: rsync for remote deployment; WP-CLI for optimising WordPress sites ahead of running Wget; and HTML Tidy for refining HTML output for better conformance with W3C standards. The latest versions are generally recommended. Otherwise, apart from Internet connectivity, there are few dependencies beyond what the shell already provides.
Please note that the system is not designed for Wget2, though it would be useful to support that in future.
robots.txt
file under the primary domain to match the outputted files.wget_threads
option).Many thanks to various developers for sharing their knowledge on shell scripting, particularly on blogs and Q&A websites such as Stack Exchange and to those who have tested, commented on and otherwise supported MakeStaticSite.
The source distribution is made available as a gzipped tar file. Download the latest version from:
https://makestaticsite.sh/download/makestaticsite_latest.tar.gz
Once downloaded, from the command line run the following to extract it:
tar -xzvf makestaticsite_latest.tar.gz
This will create a makestaticsite
directory. Enter it and then make the scripts executable:
chmod u+x *.sh
.
├── config/ # site configuration files
├── extras/ # additional site files (copied over)
├── lib/ # library files
├── log/ # log files (generated)
├── tmp/ # temporary files (generated)
├── makestaticsite.sh # main script
├── setup.sh # setup script
├── version_history.txt # summary of changes for each version
├── COPYING # software license
└── README.md
Once extracted, to try it out for the first time, at the command line enter the makestaticsite
directory and run ./setup.sh
on a URL of your choosing.
./setup.sh -u url
This will set up a configuration file with default options, which is then supplied to the main script makestaticsite.sh
to generate the site. The terminal output will provide various information, including the location of the output.
For standard usage, at the command line run ./setup.sh
You will be asked a series of questions (with suggested defaults) about the site you are mirroring with (for WordPress) options to tweak it beforehand; then, the precise wget
options to create the mirror, how it should be deployed (locally or on a remote server), whether to create a zip file, and various other options.
Once you have set up a configuration, mysite.cfg
for a domain example.com
, say, you can proceed to build the static version with:
./makestaticsite.sh -i mysite
It will proceed to generate a static mirror in the following directory:
mirror/mirror_id/example.com
where mirror_id
is a site identifier based on mysite
; when the archive
option is set, it is mysite
concatenated with a timestamp.
For other command-line options, run:
./makestaticsite.sh -h
Manual intervention should be minimal — mainly required when Wget encounters errors or when you are using WordPress and opt to add an offline search facility, in which case you will be prompted to go to the WordPress dashboard and create the search index.
MakeStaticSite divides its work into phases, of which there are ten altogether, which may be regarded as a pipeline.
Accordingly, you can run the script with arguments p
and q
, specifying start and end phases respectively such that:
1 <= p <= q <= 10
There are broadly two use cases.
(Case 1) When creating a site for the first time, you can opt to finish at any intermediate phase as far as the conclusion.
./makestaticsite -i mysite -q END_NUM
(where END_NUM
is the phase where it stops.)
Thus, to just carry out an initial run of Wget and not carry out further processing, set END_NUM
to 2.
(Case 2) An existing mirror may be modified, perhaps subsequent to a run abbreviated as above. Here, both the start and end phases may be specified:
./makestaticsite -m mirror_id -p START_NUM -q END_NUM
(where the argument -m
expects a mirror ID, START_NUM
is the phase where the script starts processing, and END_NUM
is the phase where it stops.)
The customisation of MakeStaticSite is carried out through two sets of options. We provide just a brief description here apart from those relating to Wget as this is core to the whole operation.
Configuration options define the target, i.e. the site you are capturing, any authentication requirements, options for Wget, what kinds of refinement to carry out and how to deploy the end result.
The options are stored in .cfg
files in the config
directory. They can be created manually, but it's recommended to use the setup script and then tweak as needed.
Runtime options set the general parameters for running MakeStaticSite on a particular system. These settings, stored in lib/constants.sh
, apply to any configuration file supplied, so are to be treated as universal constants. They can be tweaked on any given run, but it is strongly recommended that a backup be made first.
Wget is at the heart of MakeStaticSite and needs to be precisely configured with multiple command-line arguments to make a faithful snapshot of a site. This is why a warning is given if the version used is not very recent. Also, a single run might not be sufficient to capture everything, particularly orphaned links, so MakeStaticSite provides a separate process (when wget_extra_urls
is set) to gather additional URLs and then Wget is called again for each URL that is discovered.
There are several variables that contribute arguments, some are basic and should be included in every run, whilst others are site-specific.
(1) Configuration options
wget_extra_options
(default: -X/wp-json,/wp-admin --reject xmlrpc*
) should specify what directories should be ignored (-X
) and what file extensions not to follow (-R
or --reject
). A default setup of a CMS such as WordPress typically exposes various APIs for data retrieval, which depend on server-side scripting. These are redundant and should be removed, ideally within the CMS, with these arguments for Wget acting as a fallback.A couple of other parameters that could be supplied here:
--spider
for just testing the wget
operation without downloading files. This will still report errors (and also create the directory structure).
--limit-rate=100k
limits the wget
download rate to about 100KB per second.
Please refer to the Wget manual for details of these and other options.
(2) Runtime options
wget_core_options
(default: --mirror --convert-links --adjust-extension --page-requisites
) is a fairly standard set of arguments for generating a static mirror (phase 2); --adjust-extension
generates files with .html
extension, making the output suitable for offline browsing, which is one of the main goals of the project.
wget_extra_core_options
(default: -r -l inf -nc --adjust-extension
) is a trimmed-down version of wget_core_options
to be used in phase 3 when Wget is rerun with the assumption that hidden URLs are assets, not Web pages, which should be left alone to preserve navigation integrity.
wget_reject_clause
(default: *login*,*logout*
) is added automatically to wget_extra_options
(login/logout links are redundant in a static site and should not be followed).
Another option to facilitate crawling a remote site:
wget_user_agent
(default is empty) can be specified in the case that a host server is configured to forbid access to content without the receipt of a user agent string in a certain format (not usually including Wget). To circumvent this issue a suitable string can be specified here.Snippets provide a means to make changes to the web pages generated by Wget. For example, when mirroring a CMS, there may be (links to) login pages that should be hidden.
A snippet is a chunk on HTML to be substituted for another chunk in the original web page on the host ($url
). Each one is assigned a numerical ID, using fixed point notation with three decimal places, i.e., between 000 and 999. They are stored as files in the snippets/
directory inside MakeStaticSite's top-level directory, with filename matching their ID. Thus, a snippet with ID 001 the corresponding file is snippet001.html
. A snippet may be used in more than one site, hence they are stored together. To differentiate sets of snippets, a numbering convention may be used, e.g., 1xx for site 1, 2xx for site 2.
To incorporate snippets, the following pair of tags need to be included in the source HTML of any page where a replacement needs to be made (in WordPress, when using the Gutenberg editor, you can insert them by using the <code>
block). For ID 001, say, insert the following HTML before the content to be changed:
<!--SNIPPET001BEGIN-->
And insert the other immediately after the content:
<!--SNIPPET001END-->
An index to all the snippets is stored in snippets.data
, which lists the path to each file to be modified followed by a list of snippet identifiers. A simple tag is used to demarcate sets of snippets for a particular site, where the element name corresponds to the local site name.
The following code specifies three snippets for one site and one for another:
<sigalaresearch>
index.html:1
about/website/index.html:2,3
</sigalaresearch>
<ptworld_local>
contact/index.html:4
</ptworld_local>
After a Wget mirror is created, the script will match on the <$local_sitename>
and work through lines inside the tag pair, extracting the file path and snippet IDs. It will proceed to create a temporary copy of the file and path and apply the relevant snippet substitution. Depending on the settings, the revised file may be deployed and/or included in the zip file.
Once a snippet has been applied, the SNIPPET tag is removed, whereas if a SNIPPET tag is visible, the snippet has not yet been applied. The latter is true for content within the mirror/
directory, which contains the 'raw' snapshot before applying snippets; files which have been changed are stored in the subs/
directory.
MakeStaticSite attempts to adjust Wget output to maintain support for RSS feeds. Currently targeted at WordPress, it renames files inside feed/
directories from index.html
to index.xml
. To properly support this in deployment, on the web server, add index.xml
as the last entry to the DirectoryIndex
directive in .htaccess
at the site's root.
Many improvements could surely be made to improve the quality of the code as well as extend it, with i18n being a high priority. Another key requirement is to add support for Wget2, and HTTrack should also be considered. Also, a properly implemented modular architecture would enable enhanced support for a variety of content management systems (CMS). Whilst MakeStaticSite is authored in Bash, versions for other shells should be possible and might not require a great deal of modification.
FAQs
Unknown package
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Socket uncovered npm malware campaign mimicking popular Node.js libraries and packages from other ecosystems; packages steal data and execute remote code.
Research
Socket's research uncovers three dangerous Go modules that contain obfuscated disk-wiping malware, threatening complete data loss.
Research
Socket uncovers malicious packages on PyPI using Gmail's SMTP protocol for command and control (C2) to exfiltrate data and execute commands.