Security News
Weekly Downloads Now Available in npm Package Search Results
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.
crawler-ninja
Advanced tools
A web crawler made for the SEO based on plugins. Please wait or contribute ... still in beta
This crawler aims to build custom solutions for crawling/scraping sites. For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...
This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.
The best environment to run Crawler Ninja is a linux server.
Help & Forks welcomed ! or please wait ... work in progress !
$ npm install crawler-ninja --save
###How to use an existing plugin ?
var crawler = require("crawler-ninja");
var logger = require("crawler-ninja/plugins/log-plugin");
var c = new crawler.Crawler();
var log = new logger.Plugin(c);
c.on("end", function() {
var end = new Date();
console.log("End of crawl !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.mysite.com/"});
This script logs on the console all crawled pages thanks to the usage of the log-plugin component.
The Crawler component emits different kind of events that plugins can use (see below). When the crawl ends, the event 'end' is emitted.
###Create a new plugin
The following script show you the events callbacks that your have to implement for creating a new plugin.
This is not mandatory to implement all crawler events. You can also reduce the scope of the crawl by using the different crawl options (see below the section : option references).
// userfull lib for managing uri
var URI = require('crawler/lib/uri');
function Plugin(crawler) {
this.crawler = crawler;
/**
* Emits when the crawler found an error
*
* @param the usual error object
* @param the result of the request (contains uri, headers, ...)
*/
this.crawler.on("error", function(error, result) {
});
/**
* Emits when the crawler crawls a resource (html,js,css, pdf, ...)
*
* @param result : the result of the crawled resource
* @param the jquery like object for accessing to the HTML tags. Null is if the resource is not an HTML.
* See the project cheerio : https://github.com/cheeriojs/cheerio
*/
this.crawler.on("crawl", function(result,$) {
});
/**
* Emits when the crawler founds a link in a page
*
* @param the page that contains the link
* @param the link uri
* @param the anchor text
* @param true if the link is do follow
*/
this.crawler.on("crawlLink", function(page, link, anchor, isDoFollow) {
});
/**
* Emits when the crawler founds an image
*
* @param the page that contains the image
* @param the image uri
* @param the alt text
*/
this.crawler.on("crawlImage", function(page, link, alt) {
});
/**
* Emits when the crawler founds a redirect 3**
*
* @param the from url
* @param the to url
* @param statusCode : the exact status code : 301, 302, ...
*/
this.crawler.on("crawlRedirect", function(from, to, statusCode) {
});
}
module.exports.Plugin = Plugin;
You can pass these options to the Crawler() constructor like :
var c = new crawler.Crawler({
externalLinks : true,
scripts : false,
images : false
});
You can also use the mikeal's request options and will be directly passed to the request() method.
You can pass these options to the Crawler() constructor if you want them to be global or as items in the queue() calls if you want them to be specific to that item (overwriting global options)
If the predefined options are not sufficiants, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following example crawls only dofollow links.
var c = new crawler.Crawler({
// add here predefined options you want to override
/**
* this callback is called for each link found in an html page
* @param : the uri of the page that contains the link
* @param : the uri of the link to check
* @param : the anchor text of the link
* @param : true if the link is dofollow
* @return : true if the crawler can crawl the link on this html page
*/
canCrawl : function(htlmPage, link, anchor, isDoFollow) {
return isDoFollow;
}
});
Crawler.ninja can be configured to execute each http request through a proxy. It uses the npm package simple-proxies.
You have to install it in your project with the command :
$ npm install simple-proxies --save
Here is a code sample that uses proxies from a file :
var proxyLoader = require("simple-proxies/lib/proxyfileloader");
var crawler = require("crawler-ninja");
var logger = require("crawler-ninja/plugins/log-plugin");
var proxyFile = "proxies.txt";
// Load proxies
var config = proxyLoader.config()
.setProxyFile(proxyFile)
.setCheckProxies(false)
.setRemoveInvalidProxies(false);
proxyLoader.loadProxyFile(config, function(error, proxyList) {
if (error) {
console.log(error);
}
else {
crawl(proxyList);
}
});
function crawl(proxyList){
var c = new crawler.Crawler({
externalLinks : true,
images : false,
scripts : false,
links : false, //link tags used for css, canonical, ...
followRedirect : true,
proxyList : proxyList
});
var log = new logger.Plugin(c);
c.on("end", function() {
var end = new Date();
console.log("Well done Sir !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.site.com"});
}
The current crawl logger is based on Bunyan.
You can query the log file after the crawl (see the Bunyan doc for more informations) in order to filter errors or other info.
You can also use the current logger or create a new one in your own Plugin.
Use default loggers
var log = require("crawler-ninja./lib/logger.js").Logger;
log.info("log info"); // Log into crawler.log
log.debug("log debug"); // Log into crawler.log
log.error("log error"); // Log into crawler.log & errors.log
log.info({statusCode : 200, url: "http://www.google.com" }) // log a json
Create a new logger for your plugin
// Log into crawler.log
var log = require("crawler-ninja/lib/logger.js");
var myLog = log.createLogger("myLoggerName", "./logs/myplugin.log");
myLog.log({url:"http://www.google.com", pageRank : 10});
Please, feel free to read the code in log-plugin to get more info on how to log from you own plugin.
More features & flexibilities will be added in the upcoming releases.
All sites cannot support an intensive crawl. This crawl provide 2 solutions to control the crawl rates :
Implicit setting
Without changing the crawler config, it will decrease the crawl rate after 5 timouts errors on a host. It will force a rate of 200ms between each requests. If new 5 timout errors still occur, it will use a rate of 350ms and after that a rate of 500ms between all requests for this host. If the timouts persist, the crawler will cancel the crawl on that host.
You can change the default values for this implicit setting (5 timout errors & rates = 200, 350, 500ms). Here is an example :
var crawler = require("crawler-ninja");
var logger = require("crawler-ninja/plugins/log-plugin");
var c = new crawler.Crawler({
// new values for the implicit setting
maxErrors : 5,
errorRates : [300, 600, 900]
});
var log = new logger.Plugin(c);
c.on("end", function() {
var end = new Date();
console.log("End of crawl !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.mysite.com/"});
Note that an higher value for maxErrors can decrease the number of analyzed pages. You can assign the value -1 to maxErrors in order to desactivate the implicit setting
Explicit setting
In this configuration, you are apply the same crawl rate for all requests on all hosts.
var crawler = require("crawler-ninja");
var logger = require("crawler-ninja/plugins/log-plugin");
var c = new crawler.Crawler({
rateLimits : 200 //200ms between each request
});
var log = new logger.Plugin(c);
c.on("end", function() {
var end = new Date();
console.log("End of crawl !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.mysite.com/"});
If both settings are applied for one crawl, the implicit setting will be forced by the crawler after the "maxErrors".
0.1.0
0.1.1
0.1.2
0.1.3
0.1.4
0.1.5
0.1.6
0.1.7
0.1.8
FAQs
A web crawler made for the SEO based on plugins. Please wait or contribute ... still in beta
The npm package crawler-ninja receives a total of 9 weekly downloads. As such, crawler-ninja popularity was classified as not popular.
We found that crawler-ninja demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.
Security News
A Stanford study reveals 9.5% of engineers contribute almost nothing, costing tech $90B annually, with remote work fueling the rise of "ghost engineers."
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.