Security News
tea.xyz Spam Plagues npm and RubyGems Package Registries
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
scraping-ninja-toolkit
Advanced tools
Readme
All the goodies you'll ever need to scrape the web
You can try the library on codesandbox, it uses a cors proxy fetcher to let you grab contents from any website inside your browser.
yarn add scraping-ninja-toolkit
# or
npm i scraping-ninja-toolkit
The library is articulated around two main components:
fetcher
let you grab contents from any url,scraper
let you extract data from webpages.There is also some additional tools like an enhanced axios client.
const { fetcher } = require('scraping-ninja-toolkit');
// Fetch the given url and return a page scraper
const page = await fetcher.get('http://quotes.toscrape.com');
// Scrape an object
const quote = page.scrape('.quote', {
author: '.author@text',
text: '.text@text'
});
<!-- quote -->
{
"author": "Albert Einstein",
"text": "“The world as we have created it is a process of our thinking.“"
}
const { fetcher } = require('scraping-ninja-toolkit');
const fs = require('fs');
(async () => {
// Get categories urls
const categories = await fetcher
.get('https://coursehunters.net')
.links('.menu-aside__a');
// For each category
// => frontend
// => backend ...
const results = await fetcher.getAll(categories).map(
async (fetchNode, index) => {
// Get all courses from the catagory in an flat array
// https://coursehunters.net/frontend?page=1 => 10 courses
// https://coursehunters.net/frontend?page=1 => 10 courses
// ....
//
// allCourses => [{
// title: 'Modern HTML & CSS From The Beginning',
// url: 'https://coursehunters.net/course/sovremennyy-html-i-css-s-samogo-nachala'
// }, ... ]
const allCourses = await fetchNode
.paginate('.pagination__a[rel="next"]')
.flatMap(p =>
p.scrapeAll('article', {
title: '.standard-course-block__original@text',
url: 'a[itemprop="mainEntityOfPage"]@href'
})
);
// For each course scrape chapters
// with a concurrency of 50 queries at the same time
// and filter "undefined" values (courses without chapters)
const courses = await fetcher
.getAll(allCourses.map(c => c.url))
.map(
async p => {
console.log(`Scraping url: ${p.location}`);
const chapters = p.scrapeAll('.lessons-list__li', {
name: 'span[itemprop="name"]@text',
url: 'link[itemprop="url"]@href'
});
if (chapters && chapters.length && chapters[0].url) {
const course = allCourses.find(c => c.url === p.location);
course.chapters = chapters;
return course;
}
},
{ concurrency: 50 }
)
.filter(c => c);
return {
category: categories[index].split('/').pop(),
courses: courses
};
},
{ resolvePromise: false, concurrency: 6 }
);
fs.writeFileSync('courses.json', JSON.stringify(results, null, 2), 'utf8');
})();
• FB55: his work is the core of this library.
• Matt Mueller and cheerio contributors : A good portion of the code and concepts are copied/derived from the cheerio and x-ray scraper libraries.
MIT © 2019 Jimmy Laurent
FAQs
All the goodies you'll ever need to scrape the web
The npm package scraping-ninja-toolkit receives a total of 3 weekly downloads. As such, scraping-ninja-toolkit popularity was classified as not popular.
We found that scraping-ninja-toolkit demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
Security News
As cyber threats become more autonomous, AI-powered defenses are crucial for businesses to stay ahead of attackers who can exploit software vulnerabilities at scale.
Security News
UnitedHealth Group disclosed that the ransomware attack on Change Healthcare compromised protected health information for millions in the U.S., with estimated costs to the company expected to reach $1 billion.