Comparing version 0.1.0 to 0.2.0
{ | ||
"name": "krawler", | ||
"version": "0.1.0", | ||
"version": "0.2.0", | ||
"description": "Fast and lightweight web crawler with built-in cheerio, xml and json parser.", | ||
@@ -8,2 +8,3 @@ "keywords": [ | ||
"javascript", | ||
"crawler", | ||
"crawling", | ||
@@ -16,3 +17,5 @@ "spider", | ||
"xml", | ||
"json" | ||
"json", | ||
"promise", | ||
"event" | ||
], | ||
@@ -51,3 +54,3 @@ "maintainers": [ | ||
"scripts": { | ||
"test": "mocha test/test.js" | ||
"test": "mocha test/index" | ||
}, | ||
@@ -60,3 +63,3 @@ "engines": [ | ||
}, | ||
"main": "./lib/krawler" | ||
"main": "./lib/index" | ||
} |
# node Krawler [![Build Status](https://travis-ci.org/ondrs/node-krawler.png?branch=master)](https://travis-ci.org/ondrs/node-krawler) | ||
Fast and lightweight web crawler with built-in cheerio, xml and json parser. | ||
Fast and lightweight promise/event based web krawler with built-in cheerio, xml and json parser. | ||
And of course ... the best :) | ||
@@ -14,15 +14,19 @@ | ||
```javascript | ||
var crawler = new Krawler; | ||
var Krawler = require('krawler') | ||
crawler | ||
.queue('http://ondraplsek.cz') | ||
var urls = [ | ||
'http://ondraplsek.cz' | ||
]; | ||
var krawler = new Krawler; | ||
krawler | ||
.queue(urls) | ||
.on('data', function($, url, response) { | ||
// $ - cheerio instance | ||
// url of the current webpage | ||
// response object from mikeal/request | ||
}) | ||
.on('err', function(err, url) { | ||
// there has ben an 'err' on 'url' | ||
.on('error', function(err, url) { | ||
// there has been an 'error' on 'url' | ||
}) | ||
@@ -34,9 +38,71 @@ .on('end', function() { | ||
Krawler provides three types of built in parses | ||
- cheerio (default) | ||
- xml | ||
- json | ||
## Options | ||
Krawler provides following API: | ||
```javascript | ||
var krawler = new Krawler({ | ||
maxConnections: 10, // number of max simultaneously opened connections, default 10 | ||
parser: 'cheerio', // web page parser, default 'cheerio' | ||
// another options are xml, json or false (no parser will be used, raw data will be returned) | ||
forceUTF8: false, // if Krawler should convert source string to utf8, default false | ||
}); | ||
``` | ||
mikeal/request is used for fetching web pages so any desired option from this package can be passed to Krawler's constructor. | ||
## Advanced Example | ||
```javascript | ||
var urls = [ | ||
'https://graph.facebook.com/nodejs', | ||
'https://graph.facebook.com/facebook', | ||
'https://graph.facebook.com/cocacola', | ||
'https://graph.facebook.com/google', | ||
'https://graph.facebook.com/microsoft', | ||
]; | ||
var krawler = new Krawler({ | ||
maxConnections: 5, | ||
parser: 'json', | ||
forceUTF8: true | ||
}); | ||
krawler | ||
.on('data', function(json, url, response) { | ||
// do something with json... | ||
}) | ||
.on('error', function(err, url) { | ||
// there has been an 'error' on 'url' | ||
}) | ||
.on('end', function() { | ||
// all URLs has been fetched | ||
}); | ||
``` | ||
## Promises | ||
If your program flow is based on promises you can easily attach Krawler to your promise chain. | ||
Method fetchUrl() returns a Q.promise. When the promise is full filled, callback function is called with a result object. | ||
Object has two properties | ||
* data - parsed/raw content of the web page base on parser setting | ||
* response - response object from mikeal/request | ||
```javascript | ||
var krawler = new Krawler; | ||
findUrl() | ||
.then(function(url) { | ||
return krawler.fetchUrl(url); | ||
}) | ||
.then(function(result) { | ||
// in this case result.data in a cheerio instance | ||
return processData(result.data); | ||
}) | ||
// and so on ... | ||
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
Major refactor
Supply chain riskPackage has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.
Found 1 instance in 1 package
Dynamic require
Supply chain riskDynamic require can indicate the package is performing dangerous or unsafe dynamic code execution.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
Dynamic require
Supply chain riskDynamic require can indicate the package is performing dangerous or unsafe dynamic code execution.
Found 1 instance in 1 package
56479
107
1