Introduction
Download website to a local directory (including all css, images, js, etc.)


You can try it in demo app (source)
Installation
npm install website-scraper
Usage
var scraper = require('website-scraper');
var options = {
urls: ['http://nodejs.org/'],
directory: '/path/to/save/',
};
scraper.scrape(options, function (error, result) {
});
scraper.scrape(options).then(function (result) {
});
API
scrape(options, callback)
Makes requests to urls
and saves all files found with sources
to directory
.
options - object containing next options:
urls:
array of urls to load and filenames for them (required, see example below)directory:
path to save loaded files (required)defaultFilename:
filename for index page (optional, default: 'index.html')sources:
array of objects to load, specifies selectors and attribute values to select files for loading (optional, see default value in lib/config/defaults.js
)subdirectories:
array of objects, specifies subdirectories for file extensions. If null
all files will be saved to directory
(optional, see example below)request
: object, custom options for request (optional, see example below)recursive
: boolean, if true
scraper will follow anchors in html files. Don't forget to set maxDepth
to avoid infinite downloading (optional, see example below)maxDepth
: positive number, maximum allowed depth for dependencies (optional, see example below)
callback - callback function (optional), includes following parameters:
error:
if error - Error
object, if success - null
result:
if error - null
, if success - array if objects containing:
url:
url of loaded pagefilename:
filename where page was saved (relative to directory
)
Examples
Example 1
Let's scrape some pages from http://nodejs.org/ with images, css, js files and save them to /path/to/save/
.
Imagine we want to load:
and separate files into directories:
img
for .jpg, .png, .svg (full path /path/to/save/img
)js
for .js (full path /path/to/save/js
)css
for .css (full path /path/to/save/css
)
var scraper = require('website-scraper');
scraper.scrape({
urls: [
'http://nodejs.org/',
{url: 'http://nodejs.org/about', filename: 'about.html'},
{url: 'http://blog.nodejs.org/', filename: 'blog.html'}
],
directory: '/path/to/save',
subdirectories: [
{directory: 'img', extensions: ['.jpg', '.png', '.svg']},
{directory: 'js', extensions: ['.js']},
{directory: 'css', extensions: ['.css']}
],
sources: [
{selector: 'img', attr: 'src'},
{selector: 'link[rel="stylesheet"]', attr: 'href'},
{selector: 'script', attr: 'src'}
],
request: {
headers: {
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'
}
}
}).then(function (result) {
console.log(result);
}).catch(function(err){
console.log(err);
});
Example 2. Recursive downloading
var scraper = require('website-scraper');
scraper.scrape({
urls: ['http://example.com/'],
directory: '/path/to/save',
recursive: true,
maxDepth: 1
}).then(console.log).catch(console.log);