
Security News
AGENTS.md Gains Traction as an Open Format for AI Coding Agents
AGENTS.md is a fast-growing open format giving AI coding agents a shared, predictable way to understand project setup, style, and workflows.
headless-chrome-crawler-x
Advanced tools
Distributed crawler powered by Headless Chrome
Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js.
Powered by Headless Chrome, the crawler provides simple APIs to crawl these dynamic websites with the following features:
yarn add headless-chrome-crawler
# or "npm i headless-chrome-crawler"
Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.
const HCCrawler = require('headless-chrome-crawler');
(async () => {
const crawler = await HCCrawler.launch({
// Function to be evaluated in browsers
evaluatePage: (() => ({
title: $('title').text(),
})),
// Function to be called with evaluated results from browsers
onSuccess: (result => {
console.log(result);
}),
});
// Queue a request
await crawler.queue('https://example.com/');
// Queue multiple requests
await crawler.queue(['https://example.net/', 'https://example.org/']);
// Queue a request with custom options
await crawler.queue({
url: 'https://example.com/',
// Emulate a tablet device
device: 'Nexus 7',
// Enable screenshot by passing options
screenshot: {
path: './tmp/example-com.png'
},
});
await crawler.onIdle(); // Resolved when no queue is left
await crawler.close(); // Close the crawler
})();
See here for the full examples list. The examples can be run from the root folder as follows:
NODE_PATH=../ node examples/priority-queue.js
See here for the API reference.
See here for the debugging tips.
There are roughly two types of crawlers. One is static and the other is dynamic.
The static crawlers are based on simple requests to HTML files. They are generally fast, but fail scraping the contents when the HTML dynamically changes on browsers.
Dynamic crawlers based on PhantomJS and Selenium work magically on such dynamic applications. However, PhantomJS's maintainer has stepped down and recommended to switch to Headless Chrome, which is fast and stable. Selenium is still a well-maintained cross browser platform which runs on Chrome, Safari, IE and so on. However, crawlers do not need such cross browsers support.
This crawler is dynamic and based on Headless Chrome.
This crawler is built on top of Puppeteer.
Puppeteer provides low to mid level APIs to manupulate Headless Chrome, so you can build your own crawler with it. This way you have more controls on what features to implement in order to satisfy your needs.
However, most crawlers requires such common features as following links, obeying robots.txt and etc. This crawler is a general solution for most crawling purposes. If you want to quickly start crawling with Headless Chrome, this crawler is for you.
FAQs
Distributed web crawler powered by Headless Chrome
The npm package headless-chrome-crawler-x receives a total of 3 weekly downloads. As such, headless-chrome-crawler-x popularity was classified as not popular.
We found that headless-chrome-crawler-x demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
AGENTS.md is a fast-growing open format giving AI coding agents a shared, predictable way to understand project setup, style, and workflows.
Security News
/Research
Malicious npm package impersonates Nodemailer and drains wallets by hijacking crypto transactions across multiple blockchains.
Security News
This episode explores the hard problem of reachability analysis, from static analysis limits to handling dynamic languages and massive dependency trees.