
Takes html snapshots of your site's crawlable pages when a selector becomes visible.
Overview
html-snapshots is a flexible html snapshot library that uses PhantomJS to take html snapshots of your webpages served from your site. A snapshot is only taken when a specified selector is detected visible in the output html. This tool is useful when your site is largely ajax content, or an SPA, and you want your dynamic content indexed by search engines.
html-snapshots gets urls to process from either a robots.txt or sitemap.xml. Alternatively, you can supply an array with completely arbitrary urls, or a line delimited textfile with arbitrary host-relative paths.
html-snapshots processes all the urls in parallel in their own PhantomJS processes.
Getting Started
This library requires PhantomJS '>=1.7.1'
Installation
The simplest way to install html-snapshots is to use npm, just npm install html-snapshots
will download html-snapshots and all dependencies.
Grunt Task
If you are interested in the grunt task that uses this library, check out grunt-html-snapshots.
Example Usage
Simple example
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
source: "path/to/robots.txt",
hostname: "exampledomain.com",
outputDir: "./snapshots",
outputDirClean: true,
selector: "#dynamic-content"
});
This reads the urls from your robots.txt and produces snapshots in the ./snapshots directory. In this example, a selector named "#dynamic-content" appears in all pages across the site. Once this selector is visible in a page, the html snapshot is taken.
Example - Per page selectors and timeouts
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
input: "sitemap",
source: "path/to/sitemap.xml",
outputDir: "./snapshots",
outputDirClean: true,
selector: { "http://mysite.com": "#home-content", "__default": "#dynamic-content" },
timeout: { "http://mysite.com/superslowpage", 6000, "__default": 5000 }
});
This reads the urls from your sitemap.xml and produces snapshots in the ./snapshots directory. In this example, a selector named "#dynamic-content" appears in all pages across the site except the home page, where "#home-content" appears (the appearance of a selector in the output indicates that the page is ready for a snapshot). Also, html-snapshots uses the default timeout (5000 ms) on all pages except http://mysite.com/superslowpage, where it waits 6000 ms.
Example - Per page special output paths
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
input: "sitemap",
source: "path/to/sitemap.xml",
outputDir: "./snapshots",
outputDirClean: true,
outputPath: { "http://mysite.com/services/?page=1": "services/page/1", "http://mysite.com/services/?page=2": "services/page/2" },
selector: "#dynamic-content"
});
This example implies there are a couple of pages with query strings in sitemap.xml, and we don't want html-snapshots to create directories with query string characters in the names. We would also have to have a rewrite rule that reflects this same mapping when _escaped_fragment_
shows up in the querystring of a request so we serve the snapshot from the appropriate directory.
Example - Array
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
input: "array",
source: ["http://mysite.com", "http://mysite.com/contact", "http://mysite.com:82/special"],
outputDir: "./snapshots",
outputDirClean: true,
selector: "#dynamic-content"
});
Generates snapshots for "/", "/contact", and "/special" from mysite.com. "/special" uses port 82. All use http protocol.
Example - Completion callback, Remote robots.txt
var htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
input: "robots",
source: "http://localhost/robots.txt",
hostname: "localhost",
outputDir: "./snapshots",
outputDirClean: true,
selector: "#dynamic-content"
}, function(nonError) {
});
Generates snapshots in the ./snapshots directory for paths found in http://localhost/robots.txt. Uses those paths against "localhost" to get the actual html output. Expects "#dynamic-content" to appear in all output. The callback function is called when snapshots concludes.
Options
Apart from the default settings, there are a number of options that can be specified. Options are specified in an object to the module's run method htmlSnapshots.run({ optionName: value })
.
Example Rewrite Rule
Here is an example apache rewrite rule for rewriting _escaped_fragment_ requests to the snapshots directory on your server.
<ifModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} ^_escaped_fragment_=(.*)$
RewriteCond %{REQUEST_URI} !^/snapshots [NC]
RewriteRule ^(.*)/?$ /snapshots/$1 [L]
</ifModule>
This serves the snapshot to any request for a url (perhaps found by a bot in your robots.txt or sitemap.xml) to the snapshot output directory. In this example, no translation is done, it simply takes the request as is and serves its corresponding snapshot. So a request for http://mysite.com/?_escaped_fragment_=
serves the mysite.com homepage snapshot.