Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
crawler-ninja
Advanced tools
A web crawler made for the SEO based on plugins. Please wait or contribute ... still in beta
This crawler aims to help SEO to build custom solutions for crawling/scraping sites. For example, it can help to audit a site, find expired domains, build corpus, scrap texts, find netlinking spots, retrieve site ranking, check if web pages are correctly indexed, ...
This is just a matter of plugins ! :-) We plan to build generic & simple plugins but you are free to create your owns.
The best environment to run Crawler Ninja is a linux server.
Help & Forks welcomed ! or please wait ... work in progress !
$ npm install crawler-ninja --save
###How to use an existing plugin ?
var crawler = require("crawler-ninja");
var logger = require("crawler-ninja/plugins/log-plugin");
var c = new crawler.Crawler();
var log = new logger.Plugin(c);
c.on("end", function() {
var end = new Date();
console.log("End of crawl !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.mysite.com/"});
This script logs on the console all crawled pages thanks to the usage of the log-plugin component.
The Crawler component emits different kind of events that plugins can use (see below). When the crawl ends, the event 'end' is emitted.
###Create a new plugin
The following script show you the events callbacks that your have to implement for creating a new plugin.
This is not mandatory to implement all crawler events. You can also reduce the scope of the crawl by using the different crawl options (see below the section : option references).
// userfull lib for managing uri
var URI = require('crawler/lib/uri');
function Plugin(crawler) {
this.crawler = crawler;
/**
* Emits when the crawler found an error
*
* @param the usual error object
* @param the result of the request (contains uri, headers, ...)
*/
this.crawler.on("error", function(error, result) {
});
/**
* Emits when the crawler crawls a resource (html,js,css, pdf, ...)
*
* @param result : the result of the crawled resource
* @param the jquery like object for accessing to the HTML tags. Null is if the resource is not an HTML.
* See the project cheerio : https://github.com/cheeriojs/cheerio
*/
this.crawler.on("crawl", function(result,$) {
});
/**
* Emits when the crawler founds a link in a page
*
* @param the page that contains the link
* @param the link uri
* @param the anchor text
* @param true if the link is do follow
*/
this.crawler.on("crawlLink", function(page, link, anchor, isDoFollow) {
});
/**
* Emits when the crawler founds an image
*
* @param the page that contains the image
* @param the image uri
* @param the alt text
*/
this.crawler.on("crawlImage", function(page, link, alt) {
});
/**
* Emits when the crawler founds a redirect 3**
*
* @param the from url
* @param the to url
* @param statusCode : the exact status code : 301, 302, ...
*/
this.crawler.on("crawlRedirect", function(from, to, statusCode) {
});
}
module.exports.Plugin = Plugin;
You can pass these options to the Crawler() constructor like :
var c = new crawler.Crawler({
externalLinks : true,
scripts : false,
images : false
});
You can pass these options to the Crawler() constructor if you want them to be global or as items in the queue() calls if you want them to be specific to that item (overwriting global options)
Pool options:
priorityRange
: Number, Range of acceptable priorities starting from 0 (Default 10),priority
: Number, Priority of this request (Default 5),Server-side DOM options:
jQuery
: true, false or ConfObject (Default true)
see below Working with Cheerio or JSDOMCharset encoding:
forceUTF8
: Boolean, if true will try to detect the page charset and convert it to UTF8 if necessary. Never worry about encoding anymore! (Default false),incomingEncoding
: String, with forceUTF8: true to set encoding manually (Default null)
incomingEncoding : 'windows-1255'
for exampleCache:
cache
: Boolean, if true stores requests in memory (Default false)Other:
userAgent
: String, defaults to "node-crawler/[version]"referer
: String, if truthy sets the HTTP referer headerrateLimits
: Number of milliseconds to delay between each requests (Default 0) Note that this option will force crawler to use only one connection (for now)This options list is a strict superset of mikeal's request options and will be directly passed to the request() method.
If the predefined options are not sufficiants, you can customize which kind of links to crawl by implementing a callback function in the crawler config object. This is a nice way to limit the crawl scope in function of your needs. The following example crawls only dofollow links.
var c = new crawler.Crawler({
// add here predefined options you want to override
/**
* this callback is called for each link found in an html page
* @param : the uri of the page that contains the link
* @param : the uri of the link to check
* @param : the anchor text of the link
* @param : true if the link is dofollow
* @return : true if the crawler can crawl the link on this html page
*/
canCrawl : function(htlmPage, link, anchor, isDoFollow) {
return isDoFollow;
}
});
Crawler.ninja can be configured to execute each http request through a proxy. It uses the npm package simple-proxies.
You have to install it in your project with the command :
$ npm install simple-proxies --save
Here is a code sample that uses proxies from a file :
var proxyLoader = require("simple-proxies/lib/proxyfileloader");
var crawler = require("crawler-ninja");
var logger = require("crawler-ninja/plugins/log-plugin");
var proxyFile = "proxies.txt";
// Load proxies
var config = proxyLoader.config()
.setProxyFile(proxyFile)
.setCheckProxies(false)
.setRemoveInvalidProxies(false);
proxyLoader.loadProxyFile(config, function(error, proxyList) {
if (error) {
console.log(error);
}
else {
crawl(proxyList);
}
});
function crawl(proxyList){
var c = new crawler.Crawler({
externalLinks : true,
images : false,
scripts : false,
links : false, //link tags used for css, canonical, ...
followRedirect : true,
proxyList : proxyList
});
var log = new logger.Plugin(c);
c.on("end", function() {
var end = new Date();
console.log("Well done Sir !, done in : " + (end - start));
});
var start = new Date();
c.queue({url : "http://www.site.com"});
}
The current crawl logger is based on winston.
You can use the current logger or create a new one in your own Plugin
Use default loggers
var log = require("../lib/logger.js").Logger;
log.info("log info"); // Log into crawler.log
log.debug("log debug"); // Log into crawler.log
log.error("log error"); // Log into crawler.log & errors.log
Create a new logger for your plugin
// Log into crawler.log
var log = require("../lib/logger.js");
var myLog = log.createLogger("myLoggerName", "./logs/myplugin.log", true); //true = json output
More features & flexibilities will be added in the upcoming releases.
0.1.0
0.1.1
0.1.2
0.1.3
0.1.4
0.1.5
0.1.6
FAQs
A web crawler made for the SEO based on plugins. Please wait or contribute ... still in beta
We found that crawler-ninja demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.