Socket
Socket
Sign inDemoInstall

node-crawling-framework

Package Overview
Dependencies
470
Maintainers
1
Versions
2
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

    node-crawling-framework

NodeJs crawling & scraping framework heavily inspired by Scrapy (Pyhton)


Version published
Weekly downloads
1
Maintainers
1
Install size
32.8 MB
Created
Weekly downloads
 

Readme

Source

node-crawling-framework

Current stage: aplha (Work in progress)

"node-crawling-framework" is a crawling & scraping framework for NodeJs heavily inspired by Scrapy.

A node job server is also in motion (kinda scrapyd equivalent based on BullJs).

Features (not fully tested and finalized)

The core is working: Crawler, Scraper, Spider, item processors (pipeline), DownloadManager, downloader.

  • Modular and easily extendable architecture through middlewares and class inheritance:

    • add your own middlewares for spiders, item-processors, and downloaders.
    • extend framework spiders and get some features for free.
  • DownloadManager: delay and concurency limit settings,

  • RequestDownloader: downloader based on request package,

  • Downloader middlewares:

    • cookie: handle cookie storage between requests,
    • defaultHeaders: add default headers to each request,
    • retry: retry requests on error,
    • stats: collect some stats during the crawling (requests & errors count, ...)
  • Spiders:

    • BaseSpider: every spider must inherhit from this one,
    • Sitemap: parse sitemap and feed the spider with found urls,
    • Elasticsearch: feed spider urls with elasticsearch
  • Spider middlewares:

    • cheerio: cheerio helper on response to get a cheerio object,
    • scrapeUtils: cheerio + some helpers to facilitate the scraping (methods: scrape, scrapeUrl, scrapeRequest, ...),
    • filterDomains: filter non authorized domains
  • Item processor middlewares:

    • printConsole: log items to the console,
    • jsonLineFileExporter: write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint),
    • logger: log items to the logger,
    • elasticsearchExporter: export items to elasticsearch
  • Logger: configurable logger (default: console)

Project example

See Quotesbot

Spider example

const { BaseSpider } = require('node-crawling-framework');

class CssSpider extends BaseSpider {
  constructor() {
    super();
    this.startUrls = ['http://quotes.toscrape.com'];
  }

  *parse(response) {
    const quotes = response.scrape('div.quote');
    for (let quote of quotes) {
      yield {
        text: quote.scrape('span.text').text(),
        author: quote.scrape('small.author').text(),
        tags: quote.scrape('div.tags > a.tag').text()
      };
    }
    yield response.scrapeRequest({ selector: '.next > a' });
  }
}

module.exports = CssSpider;

Crawler configuration example

module.exports = {
  settings: {
    maxDownloadConcurency: 1, // maximum download concurrency, default: 1
    filterDuplicateRequests: true, // filter already scraped requests, default: true
    delay: 100, // delay in ms between requests, default: 0
    maxConcurrentScraping: 500, // maximum concurrent scraping, default: 500
    maxConcurrentItemsProcessingPerResponse: 100, // maximum concurrent item processing per response, default: 100
    autoCloseOnIdle: true // auto close crawler when crawling is finished, default:true
  },
  logger: null, // logger, must implement console interface, default: console
  spider: {
    type: '', // spider to use for crawling, search spider in ${cwd} or ${cwd}/spiders, can also be a class definition object
    options: {}, // spider constructor args
    middlewares: {
      scrapeUtils: {}, // add utils methods to the response, ex: "response.scrape()"
      filterDomains: {} // avoid unwanted domain requests from being scheduled
    }
  },
  itemProcessor: {
    middlewares: {
      jsonLineFileExporter: {}, // write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint)
      logger: {} // log scraped items through the crawler logger
    }
  },
  downloader: {
    type: 'RequestDownloader', // downloader to use, can also be a class definition object
    options: {}, // downloader constructor args
    middlewares: {
      stats: {}, // give some stats about requests, ex: number of requests/errors
      retry: {}, // retry on failed requests
      cookie: {} // store cookie between requests
    }
  }
};

Crawler instantiation example

const { createCrawler } = require('node-crawling-framework');

const config = require('./config');
const crawler = createCrawler(config, 'CssSpider');

crawler.crawl().then(() => {
  console.log('✨  Crawling done');
});

TODO list

  • Add unit tests
  • Add documentation
  • Add MongoDb feeder/exporter
  • Make some benchmarks ?
  • Finish formRequest scraping ( add clickables elements)
  • Utils: add date parse(moment wrapper), datapager helper ?
  • adding multi spider support ?
  • add crawling queue to settings / possibility to override the queue (could allow shared redis queue for distributed crawling)
  • allow to override/set/configure DownloadManager: could allow proxy pool handling for example
  • Puppeteer downloader:
    • be compatible with header and cookie middlewares
  • Split plugins/middlewares in packages
  • Command line tool, "ncf-cli"
    • scaffolding: create project (with wizard), spider, any middleware
    • crawl: launch crawl
    • deploy: deploy to node-job-server
  • find solution for Dns Caching
  • middleware to respect "robots.txt"
  • limit max reponse size
  • auto throttle

Keywords

FAQs

Last updated on 20 Jul 2018

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc