##Introduction
Node.js module for website's scraping with images, css, js, etc.
data:image/s3,"s3://crabby-images/02ead/02ead49826192d9a27cb7ca9b6eacd57be4a6a73" alt="Dependency Status"
data:image/s3,"s3://crabby-images/b56f9/b56f9fd699cba57b04d814dad2c3faa5232d6805" alt="NPM Stats"
##Installation
npm install website-scraper
##Usage
var scraper = require('website-scraper');
var options = {
urls: ['http://nodejs.org/'],
directory: '/path/to/save/',
};
scraper.scrape(options, function (error, result) {
});
scraper.scrape(options).then(function (result) {
});
##API
scrape(options, callback)
Makes request to url
and saves all files found with sources
to directory
.
options - object containing next options:
-
urls:
array of urls to load and filenames for them (required, see example below)
-
directory:
path to save loaded files (required)
-
log:
boolean indicates whether to write the log to console (optional, default: false)
-
defaultFilename:
filename for index page (optional, default: 'index.html')
-
sources:
array of objects to load, specifies selectors and attribute values to select files for loading (optional, see default value in lib/defaults.js
)
-
subdirectories:
array of objects, specifies subdirectories for file extensions. If null
all files will be saved to directory
(optional, see example below)
callback - callback function (optional), includes following parameters:
error:
if error - Error object
, if success - null
result:
if error - null
, if success - array if objects containing:
url:
url of loaded pagefilename:
absolute filename where page was saved
##Examples
Let's scrape some pages from http://nodejs.org/ with images, css, js files and save them to /path/to/save/
.
Imagine we want to load:
and separate files into directories:
img
for .jpg, .png, .svg (full path /path/to/save/img
)js
for .js (full path /path/to/save/js
)css
for .css (full path /path/to/save/css
)
scraper.scrape({
urls: [
'http://nodejs.org/',
{url: 'http://nodejs.org/about', filename: 'about.html'},
{url: 'http://blog.nodejs.org/', filename: 'blog.html'}
],
directory: '/path/to/save',
subdirectories: [
{directory: 'img', extensions: ['.jpg', '.png', '.svg']},
{directory: 'js', extensions: ['.js']},
{directory: 'css', extensions: ['.css']}
],
sources: [
{selector: 'img', attr: 'src'},
{selector: 'link[rel="stylesheet"]', attr: 'href'},
{selector: 'script', attr: 'src'}
]
}).then(function (result) {
console.log(result);
});
##Dependencies
- cheerio
- request
- bluebird
- fs-extra
- underscore