CrawlKit
A crawler based on PhantomJS. Allows discovery of dynamic content and supports custom scrapers. For all your ajaxy crawling & scraping needs.
- Parallel crawling/scraping via Phantom pooling.
- Custom-defined link discovery.
- Custom-defined runners (scrape, test, validate, etc.)
- Can follow redirects (and because it's based on PhantomJS, JavaScript redirects will be followed as well as
<meta>
redirects.) - Streaming
- Resilient to PhantomJS crashes
- Ignores page errors
Install
npm install crawlkit --save
Usage
const CrawlKit = require('crawlkit');
const anchorFinder = require('crawlkit/finders/genericAnchors');
const crawler = new CrawlKit('http://your/page');
crawler.setFinder({
getRunnable: () => anchorFinder
});
crawler.crawl()
.then((results) => {
console.log(JSON.stringify(results, true, 2));
}, (err) => console.error(err));
Also, have a look at the samples.
API
See the API docs (published) or the docs on doclets.io (live).
Debugging
CrawlKit uses debug for debugging purposes. In short, you can add DEBUG="*"
as an environment variable before starting your app to get all the logs. A more sane configuration is probably DEBUG="*:info,*:error,-crawlkit:pool*"
if your page is big.
Contributing
Please contribute away :)
Please add tests for new functionality and adapt them for changes.
The commit messages need to follow the conventional changelog format so semantic-release picks the semver versions properly. It is probably easiest if you install commitizen via npm install -g commitizen
and commit your changes via git cz
.
Available runners
Products using CrawlKit