scraping-ninja-toolkit

Package Overview

Dependencies

Maintainers

Versions

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

scraping-ninja-toolkit

All the goodies you'll ever need to scrape the web

1.0.0-beta.2

latest

Source

npm

Version published: 7 years ago

Weekly downloads: 6

Maintainers: 1

Weekly downloads

Created: 7 years ago

Source

scraping-ninja-toolkit

All the goodies you'll ever need to scrape the web

Documentation

In-browser Playground

You can try the library on codesandbox, it uses a cors proxy fetcher to let you grab contents from any website inside your browser.

CodeSandbox: https://codesandbox.io/s/pkyv3n2xym

Installation

yarn add scraping-ninja-toolkit
# or
npm i scraping-ninja-toolkit

Features

All in one package
Nodejs / Browsers compatibility
Blazingly fast
Extensible

Overview

The library is articulated around two main components:

the fetcher let you grab contents from any url,
the scraper let you extract data from webpages.

There is also some additional tools like an enhanced axios client.

Quick Example

const { fetcher } = require('scraping-ninja-toolkit');

// Fetch the given url and return a page scraper
const page = await fetcher.get('http://quotes.toscrape.com');

// Scrape an object
const quote = page.scrape('.quote', {
  author: '.author@text',
  text: '.text@text'
});

<!-- quote -->
{ 
  "author": "Albert Einstein", 
  "text": "“The world as we have created it is a process of our thinking.“"
}

Advanced real world example

const { fetcher } = require('scraping-ninja-toolkit');
const fs = require('fs');

(async () => {
  // Get categories urls
  const categories = await fetcher
    .get('https://coursehunters.net')
    .links('.menu-aside__a');

  // For each category
  // => frontend
  // => backend ...
  const results = await fetcher.getAll(categories).map(
    async (fetchNode, index) => {
      // Get all courses from the catagory in an flat array
      // https://coursehunters.net/frontend?page=1 => 10 courses
      // https://coursehunters.net/frontend?page=1 => 10 courses
      // ....
      //
      // allCourses => [{
      //   title: 'Modern HTML & CSS From The Beginning',
      //   url: 'https://coursehunters.net/course/sovremennyy-html-i-css-s-samogo-nachala'
      // }, ... ]
      const allCourses = await fetchNode
        .paginate('.pagination__a[rel="next"]')
        .flatMap(p =>
          p.scrapeAll('article', {
            title: '.standard-course-block__original@text',
            url: 'a[itemprop="mainEntityOfPage"]@href'
          })
        );

      // For each course scrape chapters
      // with a concurrency of 50 queries at the same time
      // and filter "undefined" values (courses without chapters)
      const courses = await fetcher
        .getAll(allCourses.map(c => c.url))
        .map(
          async p => {
            console.log(`Scraping url: ${p.location}`);

            const chapters = p.scrapeAll('.lessons-list__li', {
              name: 'span[itemprop="name"]@text',
              url: 'link[itemprop="url"]@href'
            });
            if (chapters && chapters.length && chapters[0].url) {
              const course = allCourses.find(c => c.url === p.location);
              course.chapters = chapters;
              return course;
            }
          },
          { concurrency: 50 }
        )
        .filter(c => c);

      return {
        category: categories[index].split('/').pop(),
        courses: courses
      };
    },
    { resolvePromise: false, concurrency: 6 }
  );

  fs.writeFileSync('courses.json', JSON.stringify(results, null, 2), 'utf8');
})();

Credits

• FB55: his work is the core of this library.

• Matt Mueller and cheerio contributors : A good portion of the code and concepts are copied/derived from the cheerio and x-ray scraper libraries.

License

FAQs

What is scraping-ninja-toolkit?

Is scraping-ninja-toolkit popular?

Is scraping-ninja-toolkit well maintained?

Package last updated on 23 Jan 2019

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

scraping-ninja-toolkit

scraping-ninja-toolkit

Documentation

In-browser Playground

Installation

Features

Overview

Quick Example

Advanced real world example

Credits

License

Related posts

rv Is a New Rust-Powered Ruby Version Manager Inspired by Python's uv

Nx Investigation Reveals GitHub Actions Workflow Exploit Led to npm Token Theft, Prompting Switch to Trusted Publishing