
Product
Announcing Socket Fix 2.0
Socket Fix 2.0 brings targeted CVE remediation, smarter upgrade planning, and broader ecosystem support to help developers get to zero alerts.
node-crawling-framework
Advanced tools
NodeJs crawling & scraping framework heavily inspired by Scrapy (Pyhton)
Current stage: aplha (Work in progress)
"node-crawling-framework" is a crawling & scraping framework for NodeJs heavily inspired by Scrapy.
A node job server is also in motion (kinda scrapyd equivalent based on BullJs).
The core is working: Crawler, Scraper, Spider, item processors (pipeline), DownloadManager, downloader.
Modular and easily extendable architecture through middlewares and class inheritance:
DownloadManager: delay and concurency limit settings,
RequestDownloader: downloader based on request package,
Downloader middlewares:
Spiders:
Spider middlewares:
Item processor middlewares:
Logger: configurable logger (default: console)
See Quotesbot
const { BaseSpider } = require('node-crawling-framework');
class CssSpider extends BaseSpider {
constructor() {
super();
this.startUrls = ['http://quotes.toscrape.com'];
}
*parse(response) {
const quotes = response.scrape('div.quote');
for (let quote of quotes) {
yield {
text: quote.scrape('span.text').text(),
author: quote.scrape('small.author').text(),
tags: quote.scrape('div.tags > a.tag').text()
};
}
yield response.scrapeRequest({ selector: '.next > a' });
}
}
module.exports = CssSpider;
module.exports = {
settings: {
maxDownloadConcurency: 1, // maximum download concurrency, default: 1
filterDuplicateRequests: true, // filter already scraped requests, default: true
delay: 100, // delay in ms between requests, default: 0
maxConcurrentScraping: 500, // maximum concurrent scraping, default: 500
maxConcurrentItemsProcessingPerResponse: 100, // maximum concurrent item processing per response, default: 100
autoCloseOnIdle: true // auto close crawler when crawling is finished, default:true
},
logger: null, // logger, must implement console interface, default: console
spider: {
type: '', // spider to use for crawling, search spider in ${cwd} or ${cwd}/spiders, can also be a class definition object
options: {}, // spider constructor args
middlewares: {
scrapeUtils: {}, // add utils methods to the response, ex: "response.scrape()"
filterDomains: {} // avoid unwanted domain requests from being scheduled
}
},
itemProcessor: {
middlewares: {
jsonLineFileExporter: {}, // write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint)
logger: {} // log scraped items through the crawler logger
}
},
downloader: {
type: 'RequestDownloader', // downloader to use, can also be a class definition object
options: {}, // downloader constructor args
middlewares: {
stats: {}, // give some stats about requests, ex: number of requests/errors
retry: {}, // retry on failed requests
cookie: {} // store cookie between requests
}
}
};
const { createCrawler } = require('node-crawling-framework');
const config = require('./config');
const crawler = createCrawler(config, 'CssSpider');
crawler.crawl().then(() => {
console.log('✨ Crawling done');
});
FAQs
NodeJs crawling & scraping framework heavily inspired by Scrapy (Pyhton)
We found that node-crawling-framework demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket Fix 2.0 brings targeted CVE remediation, smarter upgrade planning, and broader ecosystem support to help developers get to zero alerts.
Security News
Socket CEO Feross Aboukhadijeh joins Risky Business Weekly to unpack recent npm phishing attacks, their limited impact, and the risks if attackers get smarter.
Product
Socket’s new Tier 1 Reachability filters out up to 80% of irrelevant CVEs, so security teams can focus on the vulnerabilities that matter.