Socket
Socket
Sign inDemoInstall

simplecrawler

Package Overview
Dependencies
4
Maintainers
2
Versions
70
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

Comparing version 1.0.1 to 1.0.2

2

package.json
{
"name": "simplecrawler",
"description": "Very straightforward, event driven web crawler. Features a flexible queue interface and a basic cache mechanism with extensible backend.",
"version": "1.0.1",
"version": "1.0.2",
"homepage": "https://github.com/cgiffard/node-simplecrawler",

@@ -6,0 +6,0 @@ "author": "Christopher Giffard <christopher.giffard@cgiffard.com>",

@@ -706,31 +706,77 @@ # Simple web crawler for node.js

A: Logging in to a site is usually fairly simple and only requires an
exhange of credentials over HTTP as well as the storing of a cookie that
allows the client's session can be maintained between requests to the
server. Simplecrawler doesn't have a built-in method for this entire
procedure, but it does have an internal cookie jar that can be used to
store the cookie that's returned from a manual HTTP request.
A: Logging in to a site is usually fairly simple and most login procedures
look alike. We've included an example that covers a lot of situations, but
sadly, there isn't a one true solution for how to deal with logins, so
there's no guarantee that this code works right off the bat.
Here's an example of how to perform a manual login HTTP request with the
[request](https://npmjs.com/package/request) module and then store the
returned cookie in simplecrawler's cookie jar.
What we do here is:
1. fetch the login page,
2. store the session cookie assigned to us by the server,
3. extract any CSRF tokens or similar parameters required when logging in,
4. submit the login credentials.
```js
var Crawler = require("simplecrawler"),
url = require("url"),
cheerio = require("cheerio"),
request = require("request");
var crawler = new Crawler("https://example.com/");
var initialURL = "https://example.com/";
request.post("https://example.com/login", {
form: {
username: "iamauser",
password: "supersecurepw"
}
var crawler = new Crawler(initialURL);
request("https://example.com/login", {
// The jar option isn't necessary for simplecrawler integration, but it's
// the easiest way to have request remember the session cookie between this
// request and the next
jar: true
}, function (error, response, body) {
// Start by saving the cookies. We'll likely be assigned a session cookie
// straight off the bat, and then the server will remember the fact that
// this session is logged in as user "iamauser" after we've successfully
// logged in
crawler.cookies.addFromHeaders(response.headers["set-cookie"]);
crawler.start();
// We want to get the names and values of all relevant inputs on the page,
// so that any CSRF tokens or similar things are included in the POST
// request
var $ = cheerio.load(body),
formDefaults = {},
// You should adapt these selectors so that they target the
// appropriate form and inputs
formAction = $("#login").attr("action"),
loginInputs = $("input");
// We loop over the input elements and extract their names and values so
// that we can include them in the login POST request
loginInputs.each(function(i, input) {
var inputName = $(input).attr("name"),
inputValue = $(input).val();
formDefaults[inputName] = inputValue;
});
// Time for the login request!
request.post(url.resolve(initialURL, formAction), {
// We can't be sure that all of the input fields have a correct default
// value. Maybe the user has to tick a checkbox or something similar in
// order to log in. This is something you have to find this out manually
// by logging in to the site in your browser and inspecting in the
// network panel of your favorite dev tools what parameters are included
// in the request.
form: Object.assign(formDefaults, {
username: "iamauser",
password: "supersecretpw"
}),
// We want to include the saved cookies from the last request in this
// one as well
jar: true
}, function (error, response, body) {
// That should do it! We're now ready to start the crawler
crawler.start();
});
});
crawler.on("fetchcomplete", function (queueItem, responseBuffer, response) {
console.log("Fetched", queueItem.url);
console.log("Fetched", queueItem.url, responseBuffer.toString());
});

@@ -737,0 +783,0 @@ ```

Sorry, the diff of this file is too big to display

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc