Security News
tea.xyz Spam Plagues npm and RubyGems Package Registries
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
node-crawling-framework
Advanced tools
Readme
Current stage: aplha (Work in progress)
"node-crawling-framework" is a crawling & scraping framework for NodeJs heavily inspired by Scrapy.
A node job server is also in motion (kinda scrapyd equivalent based on BullJs).
The core is working: Crawler, Scraper, Spider, item processors (pipeline), DownloadManager, downloader.
Modular and easily extendable architecture through middlewares and class inheritance:
DownloadManager: delay and concurency limit settings,
RequestDownloader: downloader based on request package,
Downloader middlewares:
Spiders:
Spider middlewares:
Item processor middlewares:
Logger: configurable logger (default: console)
See Quotesbot
const { BaseSpider } = require('node-crawling-framework');
class CssSpider extends BaseSpider {
constructor() {
super();
this.startUrls = ['http://quotes.toscrape.com'];
}
*parse(response) {
const quotes = response.scrape('div.quote');
for (let quote of quotes) {
yield {
text: quote.scrape('span.text').text(),
author: quote.scrape('small.author').text(),
tags: quote.scrape('div.tags > a.tag').text()
};
}
yield response.scrapeRequest({ selector: '.next > a' });
}
}
module.exports = CssSpider;
module.exports = {
settings: {
maxDownloadConcurency: 1, // maximum download concurrency, default: 1
filterDuplicateRequests: true, // filter already scraped requests, default: true
delay: 100, // delay in ms between requests, default: 0
maxConcurrentScraping: 500, // maximum concurrent scraping, default: 500
maxConcurrentItemsProcessingPerResponse: 100, // maximum concurrent item processing per response, default: 100
autoCloseOnIdle: true // auto close crawler when crawling is finished, default:true
},
logger: null, // logger, must implement console interface, default: console
spider: {
type: '', // spider to use for crawling, search spider in ${cwd} or ${cwd}/spiders, can also be a class definition object
options: {}, // spider constructor args
middlewares: {
scrapeUtils: {}, // add utils methods to the response, ex: "response.scrape()"
filterDomains: {} // avoid unwanted domain requests from being scheduled
}
},
itemProcessor: {
middlewares: {
jsonLineFileExporter: {}, // write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint)
logger: {} // log scraped items through the crawler logger
}
},
downloader: {
type: 'RequestDownloader', // downloader to use, can also be a class definition object
options: {}, // downloader constructor args
middlewares: {
stats: {}, // give some stats about requests, ex: number of requests/errors
retry: {}, // retry on failed requests
cookie: {} // store cookie between requests
}
}
};
const { createCrawler } = require('node-crawling-framework');
const config = require('./config');
const crawler = createCrawler(config, 'CssSpider');
crawler.crawl().then(() => {
console.log('✨ Crawling done');
});
FAQs
NodeJs crawling & scraping framework heavily inspired by Scrapy (Pyhton)
The npm package node-crawling-framework receives a total of 0 weekly downloads. As such, node-crawling-framework popularity was classified as not popular.
We found that node-crawling-framework demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
Security News
As cyber threats become more autonomous, AI-powered defenses are crucial for businesses to stay ahead of attackers who can exploit software vulnerabilities at scale.
Security News
UnitedHealth Group disclosed that the ransomware attack on Change Healthcare compromised protected health information for millions in the U.S., with estimated costs to the company expected to reach $1 billion.