
Security News
How Enterprise Security Is Adapting to AI-Accelerated Threats
Socket CTO Ahmad Nassri discusses why supply chain attacks now target developer machines and what AI means for the future of enterprise security.
express-crawler-snapshots
Advanced tools
express.js middleware for generating web page html snapshots for web crawlers
The purpose of this express middleware is to pre-render javascript heavy pages for crawlers that can't do execute javacript on their own. It is intended as a drop-in solution with minimal configuration.
It detects search engine crawler requests by inspect User-Agent header and proxies their requests to a phantomjs instance. Phantomjs render the page fully including any async javascript and resulting static html is proxied back to the crawler.
Please note, if you use html5 history (no hashbangs) in your application, don't add a <meta name="fragment" content="!"> tag for this to work correctly.
Phantomjs 1.3+. "phantomjs" binary must be available on sys path. See http://phantomjs.org/download.html for download & install instructions
npm install express-crawler-snapshots --save
Just add it as express middleware, before route handlers
var crawlerSnapshots = require('express-crawler-snapshots');
var app = express();
//make sure you include the middleware before route handlers
app.use(crawlerSnapshots(/* {options} */));
app.use('/', require('./routes'));
Once that is done, open http://yourapp.com/?snapshot=true and view source to verify that it's working
| Option | Default | Decription |
|---|---|---|
| timeout | 10000 | ms, how long to wait for page to load on a phantomjs instance |
| delay | 200 | ms, how long to wait for javascript to settle on the page |
| snapshotTrigger | 'snapshot' | string, query param, which if present, will trigger static page render |
| agents | see source | list of UA strings for crawler bots |
| shouldRender | snapshot trigger found in query params OR user agent matches one of the agents OR escaped_fragment fonund in query params | function(req, options) { return bool;} |
| protocol | same as request | string, 'http' or 'https' |
| domain | same as request | string. Use this if you want phantomjs to call 'localhost' |
| maxInstances | 1 | max number of phantomjs instances to use |
| logger | console | object that implements 'info', 'warn', 'error' methods. Set to null for silent operation |
| attempts | 1 | number of attempts to render a page, in case phantomjs crashes or times out. Set to > 1 if phantomjs is unstable for you |
| loadImages | true | should phantom load images. Careful: there's a mem leak with older versions of QT: https://github.com/ariya/phantomjs/issues/11390 |
| maxPageLoads | 100 | if > 0, will kill phantomjs instance after x pages is loaded. Useful to work around mem leaks |
| phantomConfig | {} | an object which will be passed as config to PhantomJS |
In some rare cases you might want to kill all phantomjs instances programatically. For example, a http server won't close if it's serving an app that has this middleware active and some phantomjs instances spawned - the instances are holding onto open connections.
var crawlerSnapshots = require('express-crawler-snapshots');
crawlerSnapshots.killAllInstances.then(function() {
// done
});
New phantomjs processes are started when a bot requests comes in, number of active phantomjs processes is < maxInstanes and all active processes are currently rendering a page.
If maxInstances is reached, all phantomjs instances are busy and a new request comes in, the request is queued untll a phantomjs instance becomes available. Queue operates on first in, first out basis.
If a phantomjs process is killed from outside/dies, it's handled cleaned up gracefully and will be replaced with next request - feel free to kill them on whim :)
There's a hard timeout on opening a page and rendering content. If timeout is reached and render is still not complete, phantomjs instance is assumed toe be fubar and is forcefully killed.
Note that if an error happens while rendering a page, currently there are no retries - midleware produces an error.
npm test
FAQs
express.js middleware for generating web page html snapshots for web crawlers
The npm package express-crawler-snapshots receives a total of 1 weekly downloads. As such, express-crawler-snapshots popularity was classified as not popular.
We found that express-crawler-snapshots demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Socket CTO Ahmad Nassri discusses why supply chain attacks now target developer machines and what AI means for the future of enterprise security.

Security News
Learn the essential steps every developer should take to stay secure on npm and reduce exposure to supply chain attacks.

Security News
Experts push back on new claims about AI-driven ransomware, warning that hype and sponsored research are distorting how the threat is understood.