nodejs-web-scraper is a simple tool for scraping/crawling server-side rendered pages.
It supports features like recursive scraping(pages that "open" other pages), file download and handling, automatic retries of failed requests, concurrency limitation, pagination, request delay, etc. Tested on Node 10 and 12(Windows 7, Linux Mint).
The API uses cheerio-advanced-selectors. Click here for reference
For any questions or suggestions, please open a Github issue or contact me via https://nodejs-web-scraper.ibrod83.com/about
Installation
$ npm install nodejs-web-scraper
Table of Contents
Basic examples
Collect articles from a news site
Let's say we want to get every article(from every category), from a news site. We want each item to contain the title,
story and image link(or links).
const { Scraper, Root, DownloadContent, OpenLinks, CollectContent } = require('nodejs-web-scraper');
const fs = require('fs');
(async () => {
const config = {
baseSiteUrl: `https://www.some-news-site.com/`,
startUrl: `https://www.some-news-site.com/`,
filePath: './images/',
concurrency: 10,
maxRetries: 3,
logPath: './logs/'
}
const scraper = new Scraper(config);
const root = new Root();
const category = new OpenLinks('.category',{name:'category'});
const article = new OpenLinks('article a', {name:'article' });
const image = new DownloadContent('img', { name: 'image' });
const title = new CollectContent('h1', { name: 'title' });
const story = new CollectContent('section.content', { name: 'story' });
root.addOperation(category);
category.addOperation(article);
article.addOperation(image);
article.addOperation(title);
article.addOperation(story);
await scraper.scrape(root);
const articles = article.getData()
const stories = story.getData();
fs.writeFile('./articles.json', JSON.stringify(articles), () => { })
fs.writeFile('./stories.json', JSON.stringify(stories), () => { })
})();
This basically means: "go to https://www.some-news-site.com; Open every category; Then open every article in each category page; Then collect the title, story and image href, and download all images on that page".
Get data of every page as a dictionary
An alternative, perhaps more firendly way to collect the data from a page, would be to use the "getPageObject" hook.
const { Scraper, Root, OpenLinks, CollectContent, DownloadContent } = require('nodejs-web-scraper');
const fs = require('fs');
(async () => {
const pages = [];
const getPageObject = (pageObject,address) => {
pages.push(pageObject)
}
const config = {
baseSiteUrl: `https://www.profesia.sk`,
startUrl: `https://www.profesia.sk/praca/`,
filePath: './images/',
logPath: './logs/'
}
const scraper = new Scraper(config);
const root = new Root();
const jobAds = new OpenLinks('.list-row h2 a', { name: 'Ad page', getPageObject });
const phones = new CollectContent('.details-desc a.tel', { name: 'phone' })
const titles = new CollectContent('h1', { name: 'title' });
root.addOperation(jobAds);
jobAds.addOperation(titles);
jobAds.addOperation(phones);
await scraper.scrape(root);
fs.writeFile('./pages.json', JSON.stringify(pages), () => { });
})()
Let's describe again in words, what's going on here: "Go to https://www.profesia.sk/praca/; Then paginate the root page, from 1 to 10; Then, on each pagination page, open every job ad; Then, collect the title, phone and images of each ad."
Download all images from a page
A simple task to download all images in a page(including base64)
const { Scraper, Root, DownloadContent } = require('nodejs-web-scraper');
(async () => {
const config = {
baseSiteUrl: `https://spectator.sme.sk`,
startUrl: `https://spectator.sme.sk/`,
filePath: './images/',
cloneFiles: true,
}
const scraper = new Scraper(config);
const root = new Root();
const images = new DownloadContent('img')
root.addOperation(images);
await scraper.scrape(root);
})();
When done, you will have an "images" folder with all downloaded files.
Use multiple selectors
If you need to select elements from different possible classes("or" operator), just pass comma separated classes.
This is part of the Jquery specification(which Cheerio implemets), and has nothing to do with the scraper.
const { Scraper, Root, CollectContent } = require('nodejs-web-scraper');
(async () => {
const config = {
baseSiteUrl: `https://spectator.sme.sk`,
startUrl: `https://spectator.sme.sk/`,
}
function getElementContent(element){
}
const scraper = new Scraper(config);
const root = new Root();
const title = new CollectContent('.first_class, .second_class',{getElementContent});
root.addOperation(title);
await scraper.scrape(root);
})();
Advanced Examples
Get every job ad from a job-offering site. Each job object will contain a title, a phone and image hrefs. Being that the site is paginated, use the pagination feature.
const { Scraper, Root, OpenLinks, CollectContent, DownloadContent } = require('nodejs-web-scraper');
const fs = require('fs');
(async () => {
const pages = [];
const getPageObject = (pageObject,address) => {
pages.push(pageObject)
}
const config = {
baseSiteUrl: `https://www.profesia.sk`,
startUrl: `https://www.profesia.sk/praca/`,
filePath: './images/',
logPath: './logs/'
}
const scraper = new Scraper(config);
const root = new Root({ pagination: { queryString: 'page_num', begin: 1, end: 10 } });
const jobAds = new OpenLinks('.list-row h2 a', { name: 'Ad page', getPageObject });
const phones = new CollectContent('.details-desc a.tel', { name: 'phone' })
const images = new DownloadContent('img', { name: 'images' })
const titles = new CollectContent('h1', { name: 'title' });
root.addOperation(jobAds);
jobAds.addOperation(titles);
jobAds.addOperation(phones);
jobAds.addOperation(images);
await scraper.scrape(root);
fs.writeFile('./pages.json', JSON.stringify(pages), () => { });
})()
Let's describe again in words, what's going on here: "Go to https://www.profesia.sk/praca/; Then paginate the root page, from 1 to 10; Then, on each pagination page, open every job ad; Then, collect the title, phone and images of each ad."
Get an entire HTML file
const sanitize = require('sanitize-filename');
const fs = require('fs');
const { Scraper, Root, OpenLinks } = require('nodejs-web-scraper');
(async () => {
const config = {
baseSiteUrl: `https://www.profesia.sk`,
startUrl: `https://www.profesia.sk/praca/`,
removeStyleAndScriptTags: false
}
let directoryExists;
const getPageHtml = (html, pageAddress) => {
if(!directoryExists){
fs.mkdirSync('./html');
directoryExists = true;
}
const name = sanitize(pageAddress)
fs.writeFile(`./html/${name}.html`, html, () => { })
}
const scraper = new Scraper(config);
const root = new Root({ pagination: { queryString: 'page_num', begin: 1, end: 100 } });
const jobAds = new OpenLinks('.list-row h2 a', { getPageHtml });
root.addOperation(jobAds);
await scraper.scrape(root);
})()
Description: "Go to https://www.profesia.sk/praca/; Paginate 100 pages from the root; Open every job ad; Save every job ad page as an html file;
Downloading a file that is not an image
const config = {
baseSiteUrl: `https://www.some-content-site.com`,
startUrl: `https://www.some-content-site.com/videos`,
filePath: './videos/',
logPath: './logs/'
}
const scraper = new Scraper(config);
const root = new Root();
const video = new DownloadContent('a.video',{ contentType: 'file' });
const description = new CollectContent('h1').
root.addOperation(video);
root.addOperation(description);
await scraper.scrape(root);
console.log(description.getData())
Description: "Go to https://www.some-content-site.com; Download every video; Collect each h1; At the end, get the entire data from the "description" object;
getElementContent and getPageResponse hooks
const getPageResponse = async (response) => {
}
const myDivs=[];
const getElementContent = (content, pageAddress) => {
myDivs.push(`myDiv content from page ${pageAddress} is ${content}...`)
}
const config = {
baseSiteUrl: `https://www.nice-site`,
startUrl: `https://www.nice-site/some-section`,
}
const scraper = new Scraper(config);
const root = new Root();
const articles = new OpenLinks('article a');
const posts = new OpenLinks('.post a'{getPageResponse});
const myDiv = new CollectContent('.myDiv',{getElementContent});
root.addOperation(articles);
articles.addOperation(myDiv);
root.addOperation(posts);
posts.addOperation(myDiv)
await scraper.scrape(root);
Description: "Go to https://www.nice-site/some-section; Open every article link; Collect each .myDiv; Call getElementContent()".
"Also, from https://www.nice-site/some-section, open every post; Before scraping the children(myDiv object), call getPageResponse(); CollCollect each .myDiv".
Add additional conditions
In some cases, using the cheerio-advanced-selectors isn't enough to properly filter the DOM nodes. This is where the "condition" hook comes in. Both OpenLinks and DownloadContent can register a function with this hook, allowing you to decide if this DOM node should be scraped, by returning true or false.
const config = {
baseSiteUrl: `https://www.nice-site`,
startUrl: `https://www.nice-site/some-section`,
}
const condition = (cheerioNode) => {
const text = cheerioNode.text().trim();
if(text === 'some text i am looking for'){
return true
}
}
const scraper = new Scraper(config);
const root = new Root();
const linksToOpen = new OpenLinks('some-css-class-that-is-just-not-enough',{condition});
root.addOperation(linksToOpen);
await scraper.scrape(root);
Scraping an auth protected site
Please refer to this guide: https://nodejs-web-scraper.ibrod83.com/blog/2020/05/23/crawling-subscription-sites/
API
class Scraper(config)
The main nodejs-web-scraper object. Starts the entire scraping process via Scraper.scrape(Root). Holds the configuration and global state.
These are the available options for the scraper, with their default values:
const config ={
baseSiteUrl: '',
startUrl: '',
logPath:null,
cloneFiles: true,
removeStyleAndScriptTags: true,
concurrency: 3,
maxRetries: 5,
delay: 200,
timeout: 6000,
filePath: null,
auth: null,
headers: null,
proxy:null,
showConsoleLogs:true
}
Public methods:
Name | Description |
---|
async scrape(Root) | After all objects have been created and assembled, you begin the process by calling this method, passing the root object |
class Root([config])
Root is responsible for fetching the first page, and then scrape the children. It can also be paginated, hence the optional config. For instance:
const root= new Root({ pagination: { queryString: 'page', begin: 1, end: 100 }})
The optional config takes these properties:
{
pagination:{},
getPageObject:(pageObject,address)=>{},
getPageHtml:(htmlString,pageAddress)=>{}
getPageData:(cleanData)=>{}
getPageResponse:(response)=>{}
getException:(error)=>{}
}
Public methods:
Name | Description |
---|
addOperation(Operation) | (OpenLinks,DownloadContent,CollectContent) |
getData() | Gets all data collected by this operation. In the case of root, it will just be the entire scraping tree. |
getErrors() | In the case of root, it will show all errors in every operation. |
class OpenLinks(querySelector,[config])
Responsible for "opening links" in a given page. Basically it just creates a nodelist of anchor elements, fetches their html, and continues the process of scraping, in those pages - according to the user-defined scraping tree.
The optional config can have these properties:
{
name:'some name',
pagination:{},
condition:(cheerioNode)=>{},
getPageObject:(pageObject,address)=>{},
getPageHtml:(htmlString,pageAddress)=>{}
getElementList:(elementList)=>{},
getPageData:(cleanData)=>{}
getPageResponse:(response)=>{}
getException:(error)=>{}
slice:[start,end]
}
Public methods:
Name | Description |
---|
addOperation(Operation) | Add a scraping "operation"(OpenLinks,DownloadContent,CollectContent) |
getData() | Will get the data from all pages processed by this operation |
getErrors() | Gets all errors encountered by this operation. |
class CollectContent(querySelector,[config])
Responsible for simply collecting text/html from a given page.
The optional config can receive these properties:
{
name:'some name',
contentType:'text',
shouldTrim:true,
getElementList:(elementList,pageAddress)=>{},
getElementContent:(elementContentString,pageAddress)=>{}
getAllItems: (items, address)=>{}
slice:[start,end]
}
Public methods:
Name | Description |
---|
getData() | Gets all data collected by this operation. |
class DownloadContent(querySelector,[config])
Responsible downloading files/images from a given page.
The optional config can receive these properties:
{
name:'some name',
contentType:'image',
alternativeSrc:['first-alternative','second-alternative']
condition:(cheerioNode)=>{},
getElementList:(elementList)=>{},
getException:(error)=>{}
filePath:'./somePath',
slice:[start,end]
}
Public methods:
Name | Description |
---|
getData() | Gets all file names that were downloaded, and their relevant data |
getErrors() | Gets all errors encountered by this operation. |
nodejs-web-scraper covers most scenarios of pagination(assuming it's server-side rendered of course).
const productPages = new openLinks('a.product'{ pagination: { queryString: 'page_num', begin: 1, end: 1000 } });
{ pagination: { queryString: 'page_num', begin: 1, end: 100,offset:10 } }
{ pagination: { routingString: '/', begin: 1, end: 100 } }
Error Handling
nodejs-web-scraper will automatically repeat every failed request(except 404,400,403 and invalid images). Number of repetitions depends on the global config option "maxRetries", which you pass to the Scraper. If a request fails "indefinitely", it will be skipped. After the entire scraping process is complete, all "final" errors will be printed as a JSON into a file called "finalErrors.json"(assuming you provided a logPath).
Automatic logs
If a logPath was provided, the scraper will create a log for each operation object you create, and also the following ones: "log.json"(summary of the entire scraping tree), and "finalErrors.json"(an array of all FINAL errors encountered). I really recommend using this feature, along side your own hooks and data handling.
Concurrency
The program uses a rather complex concurrency management. Being that the memory consumption can get very high in certain scenarios, I've force-limited the concurrency of pagination and "nested" OpenLinks operations. It should still be very quick. As a general note, i recommend to limit the concurrency to 10 at most. Also the config.delay is a key a factor.
License
Copyright 2020 ibrod83
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Disclaimer
The author, ibrod83, doesn't condone the usage of the program or a part of it, for any illegal activity, and will not be held responsible for actions taken by the user. Please use it with discretion, and in accordance with international/your local law.