
Research
Security News
Malicious PyPI Package Exploits Deezer API for Coordinated Music Piracy
Socket researchers uncovered a malicious PyPI package exploiting Deezer’s API to enable coordinated music piracy through API abuse and C2 server control.
html-snapshots
Advanced tools
A selector-based html snapshot tool using PhantomJS that sources sitemap.xml, robots.txt, or arbitrary input
Takes html snapshots of your site's crawlable pages when an element you select is rendered.
html-snapshots is a flexible html snapshot library that uses PhantomJS to take html snapshots of your webpages served from your site. A snapshot is only taken when a specified selector is detected visible in the output html. This tool is useful when your site is largely ajax content, or an SPA, and you want your dynamic content indexed by search engines.
html-snapshots gets urls to process from either a robots.txt or sitemap.xml. Alternatively, you can supply an array with completely arbitrary urls, or a line delimited textfile with arbitrary host-relative paths.
The simplest way to install html-snapshots is to use npm, just npm install html-snapshots
will download html-snapshots and all dependencies.
This is a node library that just works with gulp as-is.
If you are interested in the grunt task that uses this library, check out grunt-html-snapshots.
Here are some background and other notes regarding this project.
html-snapshots takes snapshots in parallel, each page getting its own PhantomJS process. Each PhantomJS process dies after snapshotting one page. You can limit the number of PhantomJS processes that can ever run at once with the processLimit
option. This effectively sets up a process pool for PhantomJS instances. The default processLimit is 4 PhantomJS instances. When a PhantomJS process dies, and another snapshot needs to be taken, a new PhantomJS process is spawned to take the vacant slot. This continues until a processLimit
number of processes are running at once.
v0.13.2
Node 0.12 (or less)
v0.14.16
Node 4+
v0.15.x
Node 6+
v0.16.x
Node 8+
The library run
method no longer returns a boolean value indicating a successful start. Instead, it returns a Promise that resolves to an array of file paths to completed snapshots, or error on failure. The run
method's second argument, a completion callback, is now optional and provided for compatibility only. If you supply one, it will be called, but the Promise will also resolve, so it is not needed.
jQuery selectors are no longer supported by default. To restore the previous behavior, set the useJQuery
option to true
.
The upside is jQuery is no longer required to be loaded by the page being snapshotted. However, if you use jQuery selectors, or selectors not supported by querySelector, the page being snapshotted must load jQuery.
The api is just one run
method that returns a Promise.
A method that takes options and an optional callback. Returns a Promise.
Syntax:
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run (options[, callback])
.then(function (completed) {
// `completed` is an array of paths to the completed snapshots.
})
.catch(function (errorObject) {
// `errorObject` is an instance of Error
// `errorObject.completed` is an array of paths to the snapshots that did successfully complete.
// `errorObject.notCompleted` is an array of paths to files that DID NOT successfully complete.
});
The callback is optional because the run method returns a Promise that resolves on completion. If you supply a callback, it will be called, but the Promise will ALSO resolve. Callback usage is deprecated, and is made available for compatibility with older versions.
Signature of the optional callback:
callback (errorObject, arrayOfPathsToCompletedSnapshots)
For the callback, in the error case, the errorObject does not have the new extra properties completed
and notCompleted
. However, arrayOfPathsToCompletedSnapshots
is supplied, and contains the paths to the snapshots that successfully completed.
Simple examples to demonstrate the usage of options.
A growing showcase of runnable examples can be found here.
An older (version 0.13.2), more in depth usage example is located in this article that includes explanation and code of a real usage featuring dynamic app routes, ExpressJS, Heroku, and more.
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
source: "/path/to/robots.txt",
hostname: "exampledomain.com",
outputDir: "./snapshots",
outputDirClean: true,
selector: "#dynamic-content"
})
.then(function (completed) {
// completed is an array of full file paths to the completed snapshots.
})
.catch(function (error) {
// error is an Error instance.
// error.completed is an array of snapshot file paths that were completed.
// error.notCompleted is an array of file paths that did NOT complete.
});
This reads the urls from your robots.txt and produces snapshots in the ./snapshots directory. In this example, a selector named "#dynamic-content" appears in all pages across the site. Once this selector is visible in a page, the html snapshot is taken.
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
input: "sitemap",
source: "/path/to/sitemap.xml",
outputDir: "./snapshots",
outputDirClean: true,
selector: {
"http://mysite.com": "#home-content",
"__default": "#dynamic-content"
},
timeout: {
"http://mysite.com/superslowpage": 20000,
"__default": 10000
}
})
.then(function (completed) {
// completed is an array of full file paths to the completed snapshots.
})
.catch(function (error) {
// error is an Error instance.
// error.completed is an array of snapshot file paths that were completed.
// error.notCompleted is an array of file paths that did NOT complete.
});
This reads the urls from your sitemap.xml and produces snapshots in the ./snapshots directory. In this example, a selector named "#dynamic-content" appears in all pages across the site except the home page, where "#home-content" appears (the appearance of a selector in the output triggers the snapshot). Finally, a default timeout of 10000 ms is set on all pages except http://mysite.com/superslowpage, where it waits 20000 ms.
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
input: "sitemap",
source: "/path/to/sitemap.xml",
outputDir: "./snapshots",
outputDirClean: true,
outputPath: {
"http://mysite.com/services/?page=1": "services/page/1",
"http://mysite.com/services/?page=2": "services/page/2"
},
selector: "#dynamic-content"
})
.then(function (completed) {
// completed is an array of full file paths to the completed snapshots.
})
.catch(function (error) {
// error is an Error instance.
// error.completed is an array of snapshot file paths that were completed.
// error.notCompleted is an array of file paths that did NOT complete.
});
This example implies there are a couple of pages with query strings in sitemap.xml, and we don't want html-snapshots to create directories with query string characters in the names. We would also have a rewrite rule that reflects this same mapping when _escaped_fragment_
shows up in the querystring of a request so we serve the snapshot from the appropriate directory.
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
source: "/path/to/robots.txt",
hostname: "mysite.com",
outputDir: "./snapshots",
outputDirClean: true,
selector: {
"__default": "#dynamic-content",
"/jqpage": "A-Selector-Not-Supported-By-querySelector"
},
useJQuery: {
"/jqpage": true,
"__default": false
}
})
.then(function (completed) {
// completed is an array of full file paths to the completed snapshots.
})
.catch(function (error) {
// error is an Error instance.
// error.completed is an array of snapshot file paths that were completed.
// error.notCompleted is an array of file paths that did NOT complete.
});
This reads the urls from your robots.txt and produces snapshots in the ./snapshots directory. In this example, a selector named "#dynamic-content" appears in all pages across the site except in "/jqpage", where a selector not supported by querySelector is used. Further, "/jqpage" loads jQuery itself (required). All the other pages don't need to use special selectors, so the default is set to false
. Notice that since a robots.txt input is used, full URLs are not used to match selectors. Instead, paths (and QueryStrings and any Hashes) are used, just as specified in the robots.txt file itself.
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
input: "array",
source: ["http://mysite.com", "http://mysite.com/contact", "http://mysite.com:82/special"],
outputDir: "./snapshots",
outputDirClean: true,
selector: "#dynamic-content"
})
.then(function (completed) {
// completed is an array of full file paths to the completed snapshots.
})
.catch(function (error) {
// error is an Error instance.
// error.completed is an array of snapshot file paths that were completed.
// error.notCompleted is an array of file paths that did NOT complete.
});
Generates snapshots for "/", "/contact", and "/special" from mysite.com. "/special" uses port 82. All use http protocol. Array input can be powerful, check out a simple example, or a more complex example.
var assert = require("assert");
var fs = require("fs");
var htmlSnapshots = require("html-snapshots");
htmlSnapshots.run({
source: "http://localhost/robots.txt",
hostname: "localhost",
outputDir: "./snapshots",
outputDirClean: true,
selector: "#dynamic-content",
snapshotScript: {
script: "removeScripts"
}
})
.then(function (completed) {
completed.forEach(function (snapshotFile) {
var content = fs.readFileSync(snapshotFile, { encoding: "utf8"});
assert.equal(false, /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi.test(content));
});
// It didn't throw b/c there are no script tags in the html snapshots
console.log('stripped all script tags as expected');
})
.catch(function (error) {
// error is an Error instance.
// error.completed is an array of snapshot file paths that were completed.
// error.notCompleted is an array of file paths that did NOT complete.
});
Removes all script tags from the output of the html snapshot. Custom filters are also supported, see the customFilter Example in the explanation of the snapshotScript
option. Also, check out the concrete example.
Every option has a default value except
outputDir
.
input
default: "robots"
Specifies the input generator to be used to produce the urls.
Possible values:
"sitemap"
Supply urls from a local or remote sitemap.xml file. Gzipped sitemaps are supported."sitemap-index"
Supply urls from a local or remote sitemap-index.xml file. Gzipped sitemap indexes are supported."array"
, supply arbitrary urls from a javascript array."robots"
Supply urls from a local or remote robots.txt file. Robots.txt files with wildcards are NOT supported - Use "sitemap" instead."textfile"
Supply urls from a local line-oriented text file in the style of robots.txtsource
"./robots.txt"
, "./sitemap.xml"
, "./sitemap-index.xml"
, "./line.txt"
, or []
, depending on the input generator.sitemapPolicy
false
Note that for sitemap-index
, only lastmod policy element is available as a policy control.
Not all url elements in a sitemap have to have lastmod and/or changefreq (those tags are optional, unlike loc), but the urls you want to be able to skip (if they are current) must make use of those tags. You can intermix usage of these tags, as long as the requirements are met for making an age determination. If a determination on age cannot be made for any reason, the url is processed normally. For more info on sitemap tags and acceptable values, read the wikipedia page.
sitemapOutputDir
_sitemaps_
outputDir
where sitemaps are stored.
Locally stored sitemaps are used for age determinations with incoming lastmod tags. If this option is falsy, it will prevent sitemap storage and thereby disable sitemapPolicy for sitemaps referenced in a sitemap-index.The examples directory contains sitemap-index and sitemap usage examples.
hostname
"localhost"
port
auth
protocol
"http"
outputDir
outputDirClean
false
outputPath
default: none
Specifies per url overrides to the generated snapshot output path. The default output path for a snapshot file, while rooted at outputDir, is simply an echo of the input path - plus any arguments. Depending on your urls, your _escaped_fragment_
rewrite rule (see below), or the characters allowed in directory names in your environment, it might be necessary to use this option to change the output paths.
The value can be one of these javascript types:
"object"
If the value is an object, it must be a key/value pair object where the key must match the url (or path in the case of robots.txt style) found by the input generator.
"function"
If the value is a function, it is called for every page and passed a single argument that is the url (or path in the case of robots.txt style) found in the input. The value returned for a given page must be a string that can be used on the filesystem for a path.
Notes:
selector
default: "body"
Specifies the selector to find in the output before taking the snapshot. The appearence of this selector in the output triggers a snapshot to be taken.
The value can be one of these javascript types:
"string"
If the value is a string, it is used for every page.
"object"
If the value is an object, it is interpreted as key/value pairs where the key must match the url (or path in the case of robots.txt style) found by the input generator. This allows you to specify selectors for individual pages. The reserved key "__default" allows you to specify the default selector so you don't have to specify a selector for every individual page.
"function"
If the value is a function, it is called for every page and passed a single argument that is the url (or path in the case of robots.txt style) found in the input. The function must return a value to use for this option for the page it is given. The value returned for a given page must be a string.
NOTE: By default, selectors must conform to this spec, as they are used by querySelector. If you need selectors not supported by this, you must specify the useJQuery
option, and load jQuery in your page.
useJQuery
default: false
Specifies to use jQuery selectors to detect when to snapshot a page. Please note that you cannot use these selectors if the page to be snapshotted does not load jQuery itself. To return to the behavior prior to v0.6.x, set this to true
.
The value can be one of these javascript types:
"boolean"
If the value is a boolean, it is used for every page. Note that if it is any scalar type such as "string" or "number", it will be interpreted as a boolean using javascript rules. Coerced string values "true", "yes", and "1" are specifically true, all others are false.
"object"
If the value is an object, it is interpreted as key/value pairs where the key must match the url (or path in the case of robots.txt style) found by the input generator. This allows you to specify the use of jQuery for individual pages. The reserved key "__default" allows you to specify a default jQuery usage so you don't have to specify usage for every individual page.
"function"
If the value is a function, it is called for every page and passed a single argument that is the url (or path in the case of robots.txt style) found in the input. The function must return a value to use for this option for the page it is given. The value returned for a given page must be a boolean.
NOTE: You do not have to use this option if your page uses jQuery. You only need this if your selector is not supported by querySelector. However, if you do use this option, the page being snapshotted must load jQuery itself.
snapshotScript
default: This library's default snapshot script. This script runs in PhantomJS and takes the snapshot when the supplied selector becomes visible.
Specifies the PhantomJS script to run to actually produce the snapshot. The script supplied in this option is run per url (or path) by html-snapshots in a separate PhantomJS process. Applies to all pages.
The value can be one of these javascript types:
"string"
If the value is a string, it must an absolute path to a custom PhantomJS script you supply. html-snapshots will spawn a separate PhantomJS instance to run your snapshot script and give it the following arguments:
+ system.args[0]
The path to your PhantomJS script.
+ system.args[1]
The output path.
+ system.args[2]
The url to snapshot.
+ system.args[3]
The selector to watch for to signal page completion.
+ system.args[4]
The overall timeout (milliseconds).
+ system.args[5]
The interval (milliseconds) to watch for the selector.
+ system.args[6]
A flag indicating jQuery selectors should be supported.
+ system.args[7]
A flag indicating verbose output is desired.
+ system.args[8]
A custom module to load.
"object"
If an object is supplied, it has the following properties:
+ script
This must be one of the following values:
+ "removeScripts"
This runs the default snapshot script with an output filter that removes all script tags are removed from the html snapshot before it is saved.
+ "customFilter"
This runs the default snapshot script, but allows you to supply any output filter.
+ module
This property is required only if you supplied a value of "customFilter"
for the script
property. This must be an absolute path to a PhantomJS module you supply. Your module will be require
d and called as a function to filter the html snapshot output. Your module's function will receive the entire raw html content as a single input string, and must return the filtered html content.
customFilter Example:
// option snippet showing snapshotScript object with "customFilter":
{
snapshotScript: {
script: "customFilter",
module: "/path/to/myFilter.js"
}
}
// in myFilter.js:
module.exports = function(content) {
return content.replace(/someregex/g, "somereplacement"); // remove or replace anything
}
A more complete example using custom options is available here.
verbose
default: false
Specifies to turn on extended console output in the PhantomJS process for debugging purposes. Can be applied to all pages, or just specific page(s). It is recommended to do this one page at a time, as the output can be large, and interleaved with parallel processes. See following explanation of types for how to debug just one page, and also this example.
The value can be one of these javascript types:
"boolean"
If the value is a boolean, it is used for every page. Note that if it is any scalar type such as "string" or "number", it will be interpreted as a boolean using javascript rules. Coerced string values "true", "yes", and "1" are specifically true, all others are false.
"object"
If the value is an object, it is interpreted as key/value pairs where the key must match the url (or path in the case of robots.txt style) found by the input generator. This allows you to specify the use of verbose output for individual pages. The reserved key "__default" allows you to specify the default verbose
usage so you don't have to specify usage for every individual page.
"function"
If the value is a function, it is called for every page and passed a single argument that is the url (or path in the case of robots.txt style) found in the input. The function must return a value to use for this option for the page it is given. The value returned for a given page must be a boolean.
timeout
default: 10000 (milliseconds)
Specifies the time to wait for the selector to become visible.
The value can be one of these javascript types:
"number"
If the value is a number, it is used for every page in the website.
"object"
If the value is an object, it is interpreted as key/value pairs where the key must match the url (or path in the case of robots.txt style) found by the input generator. This allows you to specify timeouts for individual pages. The reserved key "__default" allows you to specify the default timeout so you don't have to specify a timeout for every individual page.
"function"
If the value is a function, it is called for every page and passed a single argument that is the url (or path in the case of robots.txt style) found in the input. The function must return a value to use for this option for the page it is given. The value returned for a given page must be a number.
processLimit
checkInterval
pollInterval
phantomjsOptions
default: ""
Specifies options to give to PhantomJS. Can specify per page or for all pages. Since PhantomJS instances run per page, it is possible to specify different PhantomJS options per page. Useful for debugging PhantomJS scripts on a specific page. For PhantomJS options syntax, checkout the current options. Checkout the source for PhantomJS options coming next.
The value can be one of these javascript types:
"string"
If the value is a string, it is a single option string used for every page.
"array"
If the value is an array, it can contain multiple option strings used for every page.
"object"
If the value is an object, it is interpreted as key/value pairs where the key must match the url (or path in the case of robots.txt style) found by the input generator. This allows you to specify PhantomJS options for individual pages. The reserved key "__default" allows you to specify default options so you don't have to specify options for every individual page. The values can be either a string (for a single option), or an array (for multiple options ).
"function"
If the value is a function, it is called for every page and passed a single argument that is the url (or path in the case of robots.txt style) found in the input. The function must return a value to use for this option for the page it is given. The value returned for a given page must be either a string or an array.
Multiple Options Examples:
// option snippet showing multiple options for all pages
{
phantomjsOptions: ["--load-images=false", "--ignore-ssl-errors=true"]
}
// option snippet showing multiple options for one page only (object notation)
{
phantomjsOptions: {
// key must exactly match the page as defined in the input (sitemap, array, robots, etc)
"http://mysite.com/mypage": ["--load-images=false", "--ignore-ssl-errors=true"],
}
}
An example demonstrating how to debug a PhantomJS script is available here. It also demonstrates per-page option usage.
phantomjs
Here is an example apache rewrite rule for rewriting _escaped_fragment_ requests to the snapshots directory on your server.
<ifModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} ^_escaped_fragment_=(.*)$
RewriteCond %{REQUEST_URI} !^/snapshots [NC]
RewriteRule ^(.*)/?$ /snapshots/$1 [L]
</ifModule>
This serves the snapshot to any request for a url (perhaps found by a bot in your robots.txt or sitemap.xml) to the snapshot output directory. In this example, no translation is done, it simply takes the request as is and serves its corresponding snapshot. So a request for http://mysite.com/?_escaped_fragment_=
serves the mysite.com homepage snapshot.
You can also refer _escaped_fragment_
requests to your snapshots in ExpressJS with a similar method using connect-modrewrite middleware. Here is an analogous example of a connect-modrewrite rule:
'^(.*)\\?_escaped_fragment_=.*$ /snapshots/$1 [NC L]'
An ExpressJS middleware example using html-snapshots can be found at wpspa/server/middleware/snapshots.js.
Here is the article on how this middleware works with html-snapshots.
This software is free to use under the LocalNerve, LLC MIT license. See the LICENSE file for license text and copyright information.
Third-party open source code used are listed in the package.json file.
FAQs
A selector-based html snapshot tool using Puppeteer or PhantomJS that sources sitemap.xml, robots.txt, or arbitrary input
We found that html-snapshots demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncovered a malicious PyPI package exploiting Deezer’s API to enable coordinated music piracy through API abuse and C2 server control.
Research
The Socket Research Team discovered a malicious npm package, '@ton-wallet/create', stealing cryptocurrency wallet keys from developers and users in the TON ecosystem.
Security News
Newly introduced telemetry in devenv 1.4 sparked a backlash over privacy concerns, leading to the removal of its AI-powered feature after strong community pushback.