domwaiter
A well-behaved URL scraper that brings you delicious DOM objects
Do you have a large collection of URLs you want to scrape? Scraping one page at a time is too slow, and scraping all the pages at once could put too much stress on the website you're scraping, and it could also crash your Node.js process due to excess memory usage. That's where this package comes in: it has a built-in rate limiter which allows you to quickly (and respectfully) collect those pages, and an event-emitting API to keep memory usage low.
Features
- Uses Promises so it's async/await friendly
- Event-emitting API to keep a low memory footprint
- Supports fetching JSON too (instead of HTML DOM)
- Rate limiting powered by bottleneck
- DOM parsing powered by cheerio (optional; can be disabled)
- HTTP requests powered by got
Installation
npm install domwaiter
Usage
const domwaiter = require('domwaiter')
const pages = [
{ url: 'https://help.github.com/en', language: 'English' },
{ url: 'https://help.github.com/ja', language: 'Japanese' },
{ url: 'https://help.github.com/cn', language: 'Chinese' }
]
domwaiter(pages)
.on('page', (page) => {
console.log(page.language, page.$('title').text())
})
.on('error', (err) => {
console.error(err)
})
.on('done', () => {
console.log('Done!')
})
API
This module exports a single function domwaiter:
domwaiter(pages, [opts])
pages Array (required) - Each item in the array must have a url property with a fully-qualified HTTP(S) URL. These object can optionally have other properties, which will be included in the emitted page events. See below.
opts Object (optional)
parseDOM Boolean - Defaults to true. Set to false if you don't need the parsed page.$ DOM object. Disabling DOM parsing will boost performance.
json Boolean - Defaults to false. Set to true if you're fetching JSON instead of HTML. If true, a json property will be present on each emitted page object (and the $ and body properties will NOT be present).
maxConcurrent Number - How many jobs can be executing at the same time. Defaults to 5. This option is passed to the underlying bottleneck instance.
minTime: Number - How long to wait after launching a job before launching another one. Defaults to 500 (milliseconds). This option is passed to the underlying bottleneck instance.
Events
The domwaiter function returns an event emitter which emits the following events:
beforePageLoad - Emitted with page object for any optional prehandling you want to do, e.g. setting up a request timer.
page - Emitted after the page has been requested and the response is parsed. Returns an object which is a shallow clone of the original page object you provided, but with two added properties:
body: the raw HTTP response body text
$: The body parsed into a jQuery-like cheerio DOM object.
error - Emitted when an error occurs fetching a URL
done - Emitted when all the pages have been fetched.
Tests
npm install
npm test
Dependencies
- bottleneck: Distributed task scheduler and rate limiter
- cheerio: Tiny, fast, and elegant implementation of core jQuery designed specifically for the server
- got: Human-friendly and powerful HTTP request library for Node.js
Dev Dependencies
- jest: Delightful JavaScript Testing.
- nock: HTTP server mocking and expectations library for Node.js
- standard: JavaScript Standard Style