Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

crawlit

Package Overview
Dependencies
Maintainers
1
Versions
6
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

crawlit

A node.js crawler support custom crawl rules for special site with thirdpart plugin.

  • 0.1.4
  • Source
  • npm
  • Socket score

Version published
Maintainers
1
Created
Source

Build status crawler

A node.js crawler support custom plugin to implement special crawl rules. Implement plugin example for crawl discus2x.

###Finished Features

  • Crawl site;
  • Filter: include/exclude URL path of the site.
  • Plugin: discuz2.0 attachments,discuz2.0 filter.
  • Queue and Crawl status.
  • Update mode.
  • Support wget cookies config. You can export site cookie use Cookie exporter.
  • Use jsdom and jQuery to get needed resources of crawled page.
  • gbk to utf-8 convert.

##Feature List:reference:http://obmem.info/?p=753

  • Support request.pipe, crawl site all in stream.pip mode.
  • Basic crawl site;
  • Proxy support;
  • Need Login?cookie auth;update and save cookie data;
    • form login?
    • support cookie
    • Browser UserAgent setting.
    • Multi-proxy support
  • Monitor:disk usage? total pages count, crawled count,crawling count,speed,memory usage,failed list;
  • CP:Monitor viewer; start/pause/stop crawler; failed/retry; change config;
  • gzip/deflate: 5 times speedup;’accept-encoding’
  • Multi-workers/Async

##Install npm install crawlit ##Usage Basic usage:

//Add basic config
require('./config/config.js');
//Override config in your own config `./config/config.local.js`

//Override config too
config.crawlOption.working_root_path: 'run/crawler';
config.crawlOption.resourceParser: require('./lib/plugins/discuz');


var crawlIt = require('crawlit').domCrawler;
crawlIt.init({update:false});
//start crawl
crawlIt.crawl(config.crawlOption.page);
//Add other crawl interface

###More Example see QiCai Crawl Example

##MIT

Keywords

FAQs

Package last updated on 08 Apr 2014

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc