simplecrawler - npm Package Compare versions

simplecrawler

Package Overview

Dependencies

Maintainers

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

Comparing version 1.0.1 to 1.0.2

package.json

		{
		"name": "simplecrawler",
		"description": "Very straightforward, event driven web crawler. Features a flexible queue interface and a basic cache mechanism with extensible backend.",
		"version": "1.0.1",
		"version": "1.0.2",
		"homepage": "https://github.com/cgiffard/node-simplecrawler",
		@@ -6,0 +6,0 @@ "author": "Christopher Giffard <christopher.giffard@cgiffard.com>",

README.md

		@@ -706,31 +706,77 @@ # Simple web crawler for node.js

		A: Logging in to a site is usually fairly simple and only requires an
		exhange of credentials over HTTP as well as the storing of a cookie that
		allows the client's session can be maintained between requests to the
		server. Simplecrawler doesn't have a built-in method for this entire
		procedure, but it does have an internal cookie jar that can be used to
		store the cookie that's returned from a manual HTTP request.
		A: Logging in to a site is usually fairly simple and most login procedures
		look alike. We've included an example that covers a lot of situations, but
		sadly, there isn't a one true solution for how to deal with logins, so
		there's no guarantee that this code works right off the bat.

		Here's an example of how to perform a manual login HTTP request with the
		[request](https://npmjs.com/package/request) module and then store the
		returned cookie in simplecrawler's cookie jar.
		What we do here is:
		1. fetch the login page,
		2. store the session cookie assigned to us by the server,
		3. extract any CSRF tokens or similar parameters required when logging in,
		4. submit the login credentials.

		```js
		var Crawler = require("simplecrawler"),
		url = require("url"),
		cheerio = require("cheerio"),
		request = require("request");

		var crawler = new Crawler("https://example.com/");
		var initialURL = "https://example.com/";

		request.post("https://example.com/login", {
		form: {
		username: "iamauser",
		password: "supersecurepw"
		}
		var crawler = new Crawler(initialURL);

		request("https://example.com/login", {
		// The jar option isn't necessary for simplecrawler integration, but it's
		// the easiest way to have request remember the session cookie between this
		// request and the next
		jar: true
		}, function (error, response, body) {
		// Start by saving the cookies. We'll likely be assigned a session cookie
		// straight off the bat, and then the server will remember the fact that
		// this session is logged in as user "iamauser" after we've successfully
		// logged in
		crawler.cookies.addFromHeaders(response.headers["set-cookie"]);
		crawler.start();

		// We want to get the names and values of all relevant inputs on the page,
		// so that any CSRF tokens or similar things are included in the POST
		// request
		var $ = cheerio.load(body),
		formDefaults = {},
		// You should adapt these selectors so that they target the
		// appropriate form and inputs
		formAction = $("#login").attr("action"),
		loginInputs = $("input");

		// We loop over the input elements and extract their names and values so
		// that we can include them in the login POST request
		loginInputs.each(function(i, input) {
		var inputName = $(input).attr("name"),
		inputValue = $(input).val();

		formDefaults[inputName] = inputValue;
		});

		// Time for the login request!
		request.post(url.resolve(initialURL, formAction), {
		// We can't be sure that all of the input fields have a correct default
		// value. Maybe the user has to tick a checkbox or something similar in
		// order to log in. This is something you have to find this out manually
		// by logging in to the site in your browser and inspecting in the
		// network panel of your favorite dev tools what parameters are included
		// in the request.
		form: Object.assign(formDefaults, {
		username: "iamauser",
		password: "supersecretpw"
		}),
		// We want to include the saved cookies from the last request in this
		// one as well
		jar: true
		}, function (error, response, body) {
		// That should do it! We're now ready to start the crawler
		crawler.start();
		});
		});

		crawler.on("fetchcomplete", function (queueItem, responseBuffer, response) {
		console.log("Fetched", queueItem.url);
		console.log("Fetched", queueItem.url, responseBuffer.toString());
		});
		@@ -737,0 +783,0 @@ ```

lib/crawler.js

Sorry, the diff of this file is too big to display

Improved metrics