Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
crawler-ninja
Advanced tools
A web crawler made for the SEO based on plugins. Please wait or contribute ... still in beta
This crawler aims to build custom solutions for crawling/scraping sites. For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...
This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.
The best environment to run Crawler Ninja is a linux server.
Help & Forks welcomed ! or please wait ... work in progress !
$ npm install crawler-ninja --save
On MacOs, if you got some issues like "Agreeing to the Xcode/iOS license requires admin privileges, please re-run as root via sudo", run the following command in the terminal :
$ sudo xcodebuild -license
Then accept the license & rerun : $ npm install crawler-ninja --save (sudo is not required, in theory)
var crawler = require("crawler-ninja");
var cs = require("crawler-ninja/plugins/console-plugin");
var options = {
scripts : false,
links : false,
images : false
}
var consolePlugin = new cs.Plugin();
crawler.init(options, function(){console.log("End of the crawl")});
crawler.registerPlugin(consolePlugin);
crawler.queue({url : "http://www.mysite.com"});
This script logs on the console all crawled pages thanks to the usage of the console-plugin component. You can register all plugins you want for each crawl by using the function registerPlugin.
The Crawler calls plugin functions depending on what kind of object is crawling (html pages, css, script, links, redirection, ...). When the crawl ends, the end callback is called (secong argument of the function init).
You can also reduce the scope of the crawl by using the different crawl options (see below the section : option references).
The following code show you the functions that your have to implement for creating a new plugin.
This is not mandatory to implement all plugin functions.
function Plugin() {
}
/**
* Function triggers when an Http error occurs for request made by the crawler
*
* @param the http error
* @param the http resource object (contains the uri of the resource)
* @param callback(error)
*/
Plugin.prototype.error = function (error, result, callback) {
}
/**
* Function triggers when an html resource is crawled
*
* @param result : the result of the resource crawl
* @param the jquery like object for accessing to the HTML tags. Null is the resource
* is not an HTML
* @param callback(error)
*/
Plugin.prototype.crawl = function(result, $, callback) {
}
/**
* Function triggers when the crawler found a link on a page
*
* @param the page url that contains the link
* @param the link found in the page
* @param the link anchor text
* @param true if the link is on follow
* @param callback(error)
*/
Plugin.prototype.crawlLink = function(page, link, anchor, isDoFollow, callback) {
}
/**
* Function triggers when the crawler found an image on a page
*
* @param the page url that contains the image
* @param the image link found in the page
* @param the image alt
* @param callback(error)
*
*/
Plugin.prototype.crawlImage = function(page, link, alt, callback) {
}
/**
* Function triggers when the crawler found an HTTP redirect
* @param the from url
* @param the to url
* @param the redirect code (301, 302, ...)
* @param callback(error)
*
*/
Plugin.prototype.crawlRedirect = function(from, to, statusCode, callback) {
}
/**
* Function triggers when a link is not crawled (depending on the crawler setting)
*
* @param the page url that contains the link
* @param the link found in the page
* @param the link anchor text
* @param true if the link is on follow
* @param callback(error)
*
*/
Plugin.prototype.unCrawl = function(page, link, anchor, isDoFollow, endCallback) {
}
module.exports.Plugin = Plugin;
You can pass change/overide the default crawl options by using the init method.
crawler.init({ scripts : false, links : false,images : false, ... }, function(){console.log("End of the crawl")});
You can also use the mikeal's request options and will be directly passed to the request() method.
You can pass these options to the init() function if you want them to be global or as items in the queue() calls if you want them to be specific to that item (overwriting global options).
If the predefined options are not sufficient, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following options crawls only dofollow links.
var options = {
// add here predefined options you want to override
/**
* this callback is called for each link found in an html page
* @param : the uri of the page that contains the link
* @param : the uri of the link to check
* @param : the anchor text of the link
* @param : true if the link is dofollow
* @return : true if the crawler can crawl the link on this html page
*/
canCrawl : function(htlmPage, link, anchor, isDoFollow) {
return isDoFollow;
}
});
Crawler.ninja can be configured to execute each http request through proxies. It uses the module simple-proxies.
You have to install it in your project with the command :
$ npm install simple-proxies --save
Here is a code sample that uses proxies from a file :
var proxyLoader = require("simple-proxies/lib/proxyfileloader");
var crawler = require("crawler-ninja");
var proxyFile = "proxies.txt";
// Load proxies
var config = proxyLoader.config()
.setProxyFile(proxyFile)
.setCheckProxies(false)
.setRemoveInvalidProxies(false);
proxyLoader.loadProxyFile(config, function(error, proxyList) {
if (error) {
console.log(error);
}
else {
crawl(proxyList);
}
});
function crawl(proxyList){
var options = {
skipDuplicates: true,
externalDomains: false,
scripts : false,
links : false,
images : false,
maxConnections : 10
}
var consolePlugin = new cs.Plugin();
crawler.init(options, done, proxyList);
crawler.registerPlugin(consolePlugin);
crawler.queue({url : "http://www.mysite.com"});
}
The current crawl logger is based on Bunyan. It logs the all crawl actions & errors in the file "./logs/crawler.log". You can query the log file after the crawl in order to filter errors or other info (see the Bunyan doc for more informations).
By default, the logger uses the level INFO. You can change this level within the init function :
crawler.init(options, done, proxyList, "debug");
The previous code init the crawler with a dedug level. If you don't use proxies, set the proxyList argument to null.
You have to install the logger module into your own plugin project :
npm install crawler-ninja-logger --save
Then, in your own Plugin code :
var log = require("crawler-ninja-logger").Logger;
log.info("log info"); // Log into crawler.log
log.debug("log debug"); // Log into crawler.log
log.error("log error"); // Log into crawler.log & errors.log
log.info({statusCode : 200, url: "http://www.google.com" }) // log a json
The crawler logs with the following structure
log.info({"url" : "url", "step" : "step", "message" : "message", "options" : "options"});
Depending on your needs, you can create additional log files.
// Log into crawler.log
var log = require("crawler-ninja-logger");
var myLog = log.createLogger("myLoggerName", {path : "./my-log-file-name.log"}););
myLog.info({url:"http://www.google.com", pageRank : 10});
Please, feel free to read the code in log-plugin to get more info on how to log from you own plugin.
All sites cannot support an intensive crawl. You can specify the crawl rate in the crawler config. The crawler will you are apply the same crawl rate for all requests on all hosts (even for successful requests).
var options = {
rateLimits : 200 //200ms between each request
};
doc!
doc!
We will certainly create external modules for the upcoming releases.
0.1.0
0.1.1
0.1.2
0.1.3
0.1.4
0.1.5
0.1.6
0.1.7
0.1.8
0.1.9
0.1.10
0.1.11
0.1.12
0.1.13
0.1.14
0.1.15
0.1.16
0.1.17
0.1.18
0.1.19
0.1.20
0.2.0
FAQs
A web crawler made for the SEO based on plugins. Please wait or contribute ... still in beta
We found that crawler-ninja demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.