Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
crawler-ninja
Advanced tools
A web crawler made for the SEO based on plugins. Please wait or contribute ... still in beta
This crawler aims to build custom solutions for crawling/scraping sites. For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...
This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.
The best environment to run Crawler Ninja is a linux server.
Help & Forks welcomed ! or please wait ... work in progress !
$ npm install crawler-ninja --save
###How to use an existing plugin ?
var crawler = require("crawler-ninja");
var cs = require("crawler-ninja/plugins/console-plugin");
var c = new crawler.Crawler();
var consolePlugin = new cs.Plugin();
c.registerPlugin(consolePlugin);
c.on("end", function() {
var end = new Date();
console.log("End of crawl !");
});
c.queue({url : "http://www.mysite.com/"});
This script logs on the console all crawled pages thanks to the usage of the console-plugin component.
The Crawler calls plugin functions in function of what kind of object is crawling (html pages, css, script, links, redirection, ...). When the crawl ends, the event 'end' is emitted.
###Create a new plugin
The following script show you the events callbacks that your have to implement for creating a new plugin.
This is not mandatory to implement all plugin functions. You can also reduce the scope of the crawl by using the different crawl options (see below the section : option references).
function Plugin() {
}
/**
* Function triggers when an Http error occurs for request made by the crawler
*
* @param the http error
* @param the http resource object (contains the uri of the resource)
* @param callback(error)
*/
Plugin.prototype.error = function (error, result, callback) {
}
/**
* Function triggers when an html resource is crawled
*
* @param result : the result of the resource crawl
* @param the jquery like object for accessing to the HTML tags. Null is the resource
* is not an HTML
* @param callback(error)
*/
Plugin.prototype.crawl = function(result, $, callback) {
}
/**
* Function triggers when the crawler found a link on a page
*
* @param the page url that contains the link
* @param the link found in the page
* @param the link anchor text
* @param true if the link is on follow
* @param callback(error)
*/
Plugin.prototype.crawlLink = function(page, link, anchor, isDoFollow, callback) {
}
/**
* Function triggers when the crawler found an image on a page
*
* @param the page url that contains the image
* @param the image link found in the page
* @param the image alt
* @param callback(error)
*
*/
Plugin.prototype.crawlImage = function(page, link, alt, callback) {
}
/**
* Function triggers when the crawler found an HTTP redirect
* @param the from url
* @param the to url
* @param the redirect code (301, 302, ...)
* @param callback(error)
*
*/
Plugin.prototype.crawlRedirect = function(from, to, statusCode, callback) {
}
/**
* Function triggers when a link is not crawled (depending on the crawler setting)
*
* @param the page url that contains the link
* @param the link found in the page
* @param the link anchor text
* @param true if the link is on follow
* @param callback(error)
*
*/
Plugin.prototype.unCrawl = function(page, link, anchor, isDoFollow, endCallback) {
}
module.exports.Plugin = Plugin;
You can pass these options to the Crawler() constructor like :
var c = new crawler.Crawler({
externalDomains : true,
scripts : false,
images : false
});
You can also use the mikeal's request options and will be directly passed to the request() method.
You can pass these options to the Crawler() constructor if you want them to be global or as items in the queue() calls if you want them to be specific to that item (overwriting global options)
If the predefined options are not sufficients, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following example crawls only dofollow links.
var c = new crawler.Crawler({
// add here predefined options you want to override
/**
* this callback is called for each link found in an html page
* @param : the uri of the page that contains the link
* @param : the uri of the link to check
* @param : the anchor text of the link
* @param : true if the link is dofollow
* @return : true if the crawler can crawl the link on this html page
*/
canCrawl : function(htlmPage, link, anchor, isDoFollow) {
return isDoFollow;
}
});
Crawler.ninja can be configured to execute each http request through a proxy. It uses the npm package simple-proxies.
You have to install it in your project with the command :
$ npm install simple-proxies --save
Here is a code sample that uses proxies from a file :
var proxyLoader = require("simple-proxies/lib/proxyfileloader");
var crawler = require("crawler-ninja");
var proxyFile = "proxies.txt";
// Load proxies
var config = proxyLoader.config()
.setProxyFile(proxyFile)
.setCheckProxies(false)
.setRemoveInvalidProxies(false);
proxyLoader.loadProxyFile(config, function(error, proxyList) {
if (error) {
console.log(error);
}
else {
crawl(proxyList);
}
});
function crawl(proxyList){
var c = new crawler.Crawler(
proxyList : proxyList
});
// Register desired plugins here
c.on("end", function() {
var end = new Date();
console.log("Well done Sir !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.site.com"});
}
The current crawl logger is based on Bunyan.
You can query the log file after the crawl (see the Bunyan doc for more informations) in order to filter errors or other info.
You can also use the current logger module in your own Plugin.
Use default loggers
You have to install the logger module into your own project :
npm install crawler-ninja-logger --save
Then, in your own Plugin code :
var log = require("crawler-ninja-logger").Logger;
log.info("log info"); // Log into crawler.log
log.debug("log debug"); // Log into crawler.log
log.error("log error"); // Log into crawler.log & errors.log
log.info({statusCode : 200, url: "http://www.google.com" }) // log a json
The crawler logs with the following structure
log.info({"url" : "url", "step" : "step", "message" : "message", "options" : "options"});
Create a new logger for your plugin
// Log into crawler.log
var log = require("crawler-ninja-logger");
var myLog = log.createLogger("myLoggerName", {path : "./log-file-name.log"}););
myLog.info({url:"http://www.google.com", pageRank : 10});
Please, feel free to read the code in log-plugin to get more info on how to log from you own plugin.
More features & flexibilities will be added in the upcoming releases.
All sites cannot support an intensive crawl. This crawl provides 2 solutions to control the crawl rates :
Implicit setting
Without changing the crawler config, it will decrease the crawl rate after 5 timouts errors on a host. It will force a rate of 200ms between each requests. If new 5 timeout errors still occur, it will use a rate of 350ms and after that a rate of 500ms between all requests for this host. If the timeouts persists, the crawler will cancel the crawl on that host.
You can change the default values for this implicit setting (5 timeout errors & rates = 200, 350, 500ms). Here is an example :
var crawler = require("crawler-ninja");
var c = new crawler.Crawler({
// new values for the implicit setting
maxErrors : 5,
errorRates : [300, 600, 900]
});
// Register your plugins here
c.on("end", function() {
var end = new Date();
console.log("End of crawl !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.mysite.com/"});
Note that an higher value for maxErrors can decrease the number of analyzed pages. You can assign the value -1 to maxErrors in order to desactivate the implicit setting
Explicit setting
In this configuration, you are apply the same crawl rate for all requests on all hosts (even for successful requests).
var crawler = require("crawler-ninja");
var logger = require("crawler-ninja/plugins/log-plugin");
var c = new crawler.Crawler({
rateLimits : 200 //200ms between each request
});
var log = new logger.Plugin(c);
c.on("end", function() {
var end = new Date();
console.log("End of crawl !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.mysite.com/"});
If both settings are applied for one crawl, the implicit setting will be forced by the crawler after the "maxErrors".
0.1.0
0.1.1
0.1.2
0.1.3
0.1.4
0.1.5
0.1.6
0.1.7
0.1.8
0.1.9
0.1.10
0.1.11
0.1.12
FAQs
A web crawler made for the SEO based on plugins. Please wait or contribute ... still in beta
The npm package crawler-ninja receives a total of 7 weekly downloads. As such, crawler-ninja popularity was classified as not popular.
We found that crawler-ninja demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.