
Research
Security News
The Landscape of Malicious Open Source Packages: 2025 MidâYear Threat Report
A look at the top trends in how threat actors are weaponizing open source packages to deliver malware and persist across the software supply chain.
A Node.js scraper for humans.
Sponsored with :heart: by:
Serpapi.com is a platform that allows you to scrape Google and other search engines from our fast, easy, and complete API.
Capsolver.com is an AI-powered service that specializes in solving various types of captchas automatically. It supports captchas such as reCAPTCHA V2, reCAPTCHA V3, hCaptcha, FunCaptcha, DataDome, AWS Captcha, Geetest, and Cloudflare Captcha / Challenge 5s, Imperva / Incapsula, among others. For developers, Capsolver offers API integration options detailed in their documentation, facilitating the integration of captcha solving into applications. They also provide browser extensions for Chrome and Firefox, making it easy to use their service directly within a browser. Different pricing packages are available to accommodate varying needs, ensuring flexibility for users.
# Using npm
npm install --save scrape-it
# Using yarn
yarn add scrape-it
:bulb: ProTip: You can install the cli version of this module by running npm install --global scrape-it-cli
(or yarn global add scrape-it-cli
).
Here are some frequent questions and their answers.
scrape-it
has only a simple request module for making requests. That means you cannot directly parse ajax pages with it, but in general you will have those scenarios:
scrape-it
the ajax url (e.g. example.com/api/that-endpoint
) and you will you will be able to parse the response.scrapeHTML
method from scrape it once you get the HTML loaded on the page.There is no fancy way to crawl pages with scrape-it
. For simple scenarios, you can parse the list of urls from the initial page and then, using Promises, parse each page. Also, you can use a different crawler to download the website and then use the .scrapeHTML
method to scrape the local files.
Use the .scrapeHTML
to parse the HTML read from the local files using fs.readFile
.
const scrapeIt = require("scrape-it")
// Promise interface
scrapeIt("https://ionicabizau.net", {
title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
}).then(({ data, status }) => {
console.log(`Status Code: ${status}`)
console.log(data)
});
// Async-Await
(async () => {
const { data } = await scrapeIt("https://ionicabizau.net", {
// Fetch the articles
articles: {
listItem: ".article"
, data: {
// Get the article date and convert it into a Date object
createdAt: {
selector: ".date"
, convert: x => new Date(x)
}
// Get the title
, title: "a.article-title"
// Nested list
, tags: {
listItem: ".tags > span"
}
// Get the content
, content: {
selector: ".article-content"
, how: "html"
}
// Get attribute value of root listItem by omitting the selector
, classes: {
attr: "class"
}
}
}
// Fetch the blog pages
, pages: {
listItem: "li.page"
, name: "pages"
, data: {
title: "a"
, url: {
selector: "a"
, attr: "href"
}
}
}
// Fetch some other data from the page
, title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
})
console.log(data)
// { articles:
// [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
// title: 'Pi Day, Raspberry Pi and Command Line',
// tags: [Object],
// content: '<p>Everyone knows (or should know)...a" alt=""></p>\n',
// classes: [Object] },
// { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
// title: 'How I ported Memory Blocks to modern web',
// tags: [Object],
// content: '<p>Playing computer games is a lot of fun. ...',
// classes: [Object] },
// { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
// title: 'How to convert JSON to Markdown using json2md',
// tags: [Object],
// content: '<p>I love and ...',
// classes: [Object] } ],
// pages:
// [ { title: 'Blog', url: '/' },
// { title: 'About', url: '/about' },
// { title: 'FAQ', url: '/faq' },
// { title: 'Training', url: '/training' },
// { title: 'Contact', url: '/contact' } ],
// title: 'IonicÄ BizÄu',
// desc: 'Web Developer, Linux geek and Musician',
// avatar: '/images/logo.png' }
})()
scrapeIt(url, opts, cb)
A scraping module for humans.
url
: The page url or request options.opts
: The options passed to scrapeHTML
method.cb
: The callback function.data
(Object): The scraped data.$
(Function): The Cheeerio function. This may be handy to do some other manipulation on the DOM, if needed.response
(Object): The response object.body
(String): The raw body as a string.scrapeIt.scrapeHTML($, opts)
Scrapes the data in the provided element.
For the format of the selector, please refer to the Selectors section of the Cheerio library
Cheerio $
: The input element.
Object opts
: An object containing the scraping information.
If you want to scrape a list, you have to use the listItem
selector:
listItem
(String): The list item selector.data
(Object): The fields to include in the list objects:
<fieldName>
(Object|String): The selector or an object containing:
selector
(String): The selector.convert
(Function): An optional function to change the value.how
(Function|String): A function or function name to access the
value.attr
(String): If provided, the value will be taken based on
the attribute name.trimValue
(Boolean): If false
, the value will not be trimmed
(default: true
).closest
(String): If provided, returns the first ancestor of
the given element.eq
(Number): If provided, it will select the nth element.texteq
(Number): If provided, it will select the nth direct text child.
Deep text child selection is not possible yet.
Overwrites the how
key.listItem
(Object): An object, keeping the recursive schema of
the listItem
object. This can be used to create nested lists.Example:
{
articles: {
listItem: ".article"
, data: {
createdAt: {
selector: ".date"
, convert: x => new Date(x)
}
, title: "a.article-title"
, tags: {
listItem: ".tags > span"
}
, content: {
selector: ".article-content"
, how: "html"
}
, traverseOtherNode: {
selector: ".upperNode"
, closest: "div"
, convert: x => x.length
}
}
}
}
If you want to collect specific data from the page, just use the same
schema used for the data
field.
Example:
{
title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
}
There are few ways to get help:
Have an idea? Found a bug? See how to contribute.
I open-source almost everything I can, and I try to reply to everyone needing help using these projects. Obviously, this takes time. You can integrate and use these projects in your applications for free! You can even change the source code and redistribute (even resell it).
However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:
Starring and sharing the projects you like :rocket:
âI love books! I will remember you after years if you buy me one. :grin: :book:
âYou can make one-time donations via PayPal. I'll probably buy a
coffee tea. :tea:
âSet up a recurring monthly donation and you will get interesting news about what I'm doing (things that I don't share with everyone).
BitcoinâYou can send me bitcoins at this address (or scanning the code below): 1P9BRsmazNQcuyTxEqveUsnf5CERdq35V6
Thanks! :heart:
If you are using this library in one of your projects, add it in this list. :sparkles:
3abn
@alexjorgef/bandcamp-scraper
@ben-wormald/bandcamp-scraper
@bogochunas/package-shopify-crawler
@lukekarrys/ebp
@markab.io/node-api
@thetrg/gibson
@tryghost/mg-webscraper
@web-master/node-web-scraper
@zougui/furaffinity
airport-cluj
apixpress
bandcamp-scraper
beervana-scraper
bible-scraper
blankningsregistret
blockchain-notifier
brave-search-scraper
camaleon
carirs
cevo-lookup
cnn-market
codementor
codinglove-scraper
covidau
degusta-scrapper
dncli
egg-crawler
fa.js
flamescraper
fmgo-marketdata
gatsby-source-bandcamp
growapi
helyesiras
jishon
jobs-fetcher
leximaven
macoolka-net-scrape
macoolka-network
mersul-microbuzelor
mersul-trenurilor
mit-ocw-scraper
mix-dl
node-red-contrib-getdata-website
node-red-contrib-scrape-it
nurlresolver
paklek-cli
parn
picarto-lib
rayko-tools
rs-api
sahibinden
sahibindenServer
salesforcerelease-parser
scrape-it-cli
scrape-vinmonopolet
scrapemyferry
scrapos-worker
sgdq-collector
simple-ai-alpha
spon-market
startpage-quick-search
steam-workshop-scraper
trump-cabinet-picks
u-pull-it-ne-parts-finder
ubersetzung
ui-studentsearch
university-news-notifier
uniwue-lernplaetze-scraper
vandalen.rhyme.js
wikitools
yu-ncov-scrape-dxy
FAQs
A Node.js scraper for humans.
The npm package scrape-it receives a total of 0 weekly downloads. As such, scrape-it popularity was classified as not popular.
We found that scrape-it demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A look at the top trends in how threat actors are weaponizing open source packages to deliver malware and persist across the software supply chain.
Security News
ESLint now supports HTML linting with 48 new rules, expanding its language plugin system to cover more of the modern web development stack.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.