crawlit

A node.js crawler support custom crawl rules for special site with thirdpart plugin.

0.1.4
Source
npm

Version published: 11 years ago

Maintainers: 1

Created: 11 years ago

Source

crawler

A node.js crawler support custom plugin to implement special crawl rules. Implement plugin example for crawl discus2x.

###Finished Features

Crawl site;
Filter: include/exclude URL path of the site.
Plugin: discuz2.0 attachments,discuz2.0 filter.
Queue and Crawl status.
Update mode.
Support wget cookies config. You can export site cookie use Cookie exporter.
Use jsdom and jQuery to get needed resources of crawled page.
gbk to utf-8 convert.

##Feature List：reference：http://obmem.info/?p=753

Support request.pipe, crawl site all in stream.pip mode.
Basic crawl site；
Proxy support；
Need Login？cookie auth；update and save cookie data;
- form login？
- support cookie
- Browser UserAgent setting.
- Multi-proxy support
Monitor：disk usage？ total pages count, crawled count，crawling count，speed，memory usage，failed list;
CP：Monitor viewer; start/pause/stop crawler; failed/retry; change config;
gzip/deflate: 5 times speedup；’accept-encoding’
Multi-workers/Async

##Install npm install crawlit ##Usage Basic usage:

//Add basic config
require('./config/config.js');
//Override config in your own config `./config/config.local.js`

//Override config too
config.crawlOption.working_root_path: 'run/crawler';
config.crawlOption.resourceParser: require('./lib/plugins/discuz');


var crawlIt = require('crawlit').domCrawler;
crawlIt.init({update:false});
//start crawl
crawlIt.crawl(config.crawlOption.page);
//Add other crawl interface

###More Example see QiCai Crawl Example

##MIT

Keywords

FAQs

What is crawlit?

Is crawlit well maintained?

Package last updated on 08 Apr 2014

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

crawlit

Keywords

Related posts

Roblox Developers Targeted with npm Packages Infected with Skuld Infostealer and Blank Grabber

vlt Debuts New JavaScript Package Manager and Serverless Registry at NodeConf EU