Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

itscrap

Package Overview
Dependencies
Maintainers
1
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

itscrap

Scrapping library

  • 1.0.0
  • latest
  • npm
  • Socket score

Version published
Maintainers
1
Created
Source

#Itscrap

Itscrap is a nodejs library whose aim is to help you scrap web listings very easily.

Instalation

Go to your project folder and run:

npm install --save itscrap

Introduction

Itscrap is designed to sequentially scrap lists of web pages. It does it with a scraping loop. And this loop looks like this:

  1. Get the next page to be scraped.
  2. Scrap it (extracts the data)
  3. Process the extracted data.
  4. Execute an optional post-processing function.
  5. If the user didn't stop the scraper, go back to (1).

Usage

Let's scrap a listing of flats from Craigslist. The first thing we need to do is to create our scraper class and make it inherit from Itscrap.

const Itscrap = require('itscrap');

class CraigslistScraper extends Itscrap { }

Before instantiating and running this scraper we need to define some logic into it.

When we run our scraper, the first thing it will do is to call the getNextUrl() method. So, let's define it.

class CraigslistScraper extends Itscrap { 
	getNextUrl() {
		return `https://newyork.craigslist.org/search/aap?s=${this.params.offset}`;
	}
}

In this method we must return the URL of the site to be scraped next. This is just the flats listing page URL from Craigslist but you can see that we are using an "offset" value from the this.params object.

Itscrap allows you to set an initial value for these params using the getInitialParams() method. So, let's initialize our listing offset with a value of 0.

class CraigslistScraper extends Itscrap { 
	getInitialParams() {
		return { offset: 0 };
	}

	getNextUrl() { ... }
}

After fetching the page returned by getNextUrl(), Itscrap will convert the site to a cheerio instance to facilitate DOM manipulation. Next, it will call scrap() method with the cheerio instance and the url of the site being scraped.

class CraigslistScraper extends Itscrap { 
	getInitialParams() { ... }
	getNextUrl() { ... }
	
	scrap($, url) {
		console.log('Scraping', url);
	
		// Get all the items in the list.
		const listItems = $('.result-row');
		const flatsData = [];
		
		// If the list is empty, stop the scraper.
		if (listItems.length === 0) this.stop();
		
		listItems.each(function() {
			// Wrap the current element in a cheerio instance.
			const el = $(this);	
			const title = 
	
			flatsData.push({
				title: el.find('.result.title').text(),
				price: el.find('.result.price').text()
			});
		});
		
		return flatsData;
	}
}

Now we have a scrap method that extracts the title and the price information from each item in the listing. Also, once it has ran out of results (i.e. it got to the end of the listing) it calls this.stop() to stop the scraper.

However, altough this scraper extracts the data from Craigslist, it does nothing with it. Now we need to implement the process(data) method, which takes the return value of scrap() as a first argument.

class CraigslistScraper extends Itscrap { 
	getInitialParams() { ... }
	getNextUrl() { ... }
	scrap($, url) { ... }
	
	process(data) {
		data.forEach(flat => console.log(`${flat.price} - ${flat.title}`));
	}

Here we told the scraper to print the extracted data to the console after it has finished scraping the current page. This may not seem really useful, but you could be storing it in a database, or extracting the URL of the flat detail page and pushing it into a queue to be scraped by another scraper.

Lastly, before running the scraper we need to update the this.params.offset value, otherwise it will scrap the same page indefinitely. We can do that in the afterEach() method. This method will be executed after the process() method for each scraped web page.

class CraigslistScraper extends Itscrap { 
	getInitialParams() { ... }
	getNextUrl() { ... }
	scrap($, url) { ... }
	process(data) { ... }
	
	afterEach() {
		// Note: Craigslist pagination size is 120.
		this.params.offset += 120;
	}

And that's it. All you have to do now is to instantiate the scraper and run it.

const scraper = new CraigslistScraper();
scraper.run();

Debugging

If, for some reason, you need to see what's going on at a deeper level, you can increase the logging level with setLogLevel(). Currently there are three levels (0, 1 and 2) with 0 being no logging and 2 being the most verbose.

const scraper = new CraigslistScraper();
scraper.setLogLevel(2);
scraper.run();

FAQs

Package last updated on 03 Oct 2017

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc