
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
website-scrap-engine
Advanced tools
Configurable website scraper library in TypeScript. Consumers provide a DownloadOptions config (which includes a ProcessingLifeCycle) and instantiate a downloader to recursively scrape websites to local disk.
worker_threads) downloadersurl() extraction and rewritingsrcset, Open Graph meta tags, inline styles, and SVG xlink:href supportRetry-After header supportfile:// source support for re-processing previously saved sitesskip, retry, error, notFound, etc.)npm install website-scrap-engine
Requires Node.js >= 18.17.0.
The downloader takes a path (or file:// URL) to a module that default-exports a DownloadOptions object. This pattern allows worker threads to independently load the same configuration.
Step 1: Create an options module (e.g. my-options.js)
import {lifeCycle, options, resource} from 'website-scrap-engine';
const {defaultLifeCycle} = lifeCycle;
const {defaultDownloadOptions} = options;
const {ResourceType} = resource;
const lc = defaultLifeCycle();
// Example: skip binary resources deeper than depth 2
lc.processBeforeDownload.push((res) => {
if (res.depth > 2 && res.type === ResourceType.Binary) return;
return res;
});
export default defaultDownloadOptions({
...lc,
localRoot: '/path/to/save',
maxDepth: 3,
initialUrl: ['https://example.com'],
});
Step 2: Create and run the downloader
import path from 'path';
import {downloader} from 'website-scrap-engine';
const {SingleThreadDownloader} = downloader;
const d = new SingleThreadDownloader(
'file://' + path.resolve('my-options.js')
);
d.start();
d.onIdle().then(() => d.dispose());
For CPU-intensive workloads, use MultiThreadDownloader instead (see Multi-Thread Processing).
You can also pass override options as the second argument to the downloader constructor, which are merged into the options module's export:
new SingleThreadDownloader('file://' + path.resolve('my-options.js'), {
localRoot: '/different/path',
concurrency: 8,
});
The library provides adapter functions in lifeCycle.adapter for common customization patterns:
| Adapter | Stage | Description |
|---|---|---|
skipProcess(fn) | linkRedirect | Skip URLs matching a predicate |
dropResource(fn) | processBeforeDownload | Mark matching resources as discard-only (replace link but don't download) |
preProcess(fn) | processBeforeDownload | Inspect/modify resources before download |
requestRedirect(fn) | processBeforeDownload | Rewrite the download URL |
redirectFilter(fn) | processAfterDownload | Rewrite or discard redirect URLs |
processHtml(fn) | processAfterDownload | Transform the parsed HTML (cheerio $) |
processHtmlAsync(fn) | processAfterDownload | Async version of processHtml |
import {lifeCycle} from 'website-scrap-engine';
const lc = lifeCycle.defaultLifeCycle();
// Skip all URLs containing "/api/"
lc.linkRedirect.push(lifeCycle.adapter.skipProcess(
(url) => url.includes('/api/')
));
// Drop images from download but still rewrite their links
lc.processBeforeDownload.push(lifeCycle.adapter.dropResource(
(res) => res.type === ResourceType.Binary && res.url.endsWith('.png')
));
Resources are processed through a sequential pipeline of hook arrays. Each stage is an array of functions executed in order. Returning void/undefined from any function discards the resource from that stage onward.
init (once per downloader/worker startup)
|
v
URL
|
v
1. linkRedirect -----> skip or redirect URLs before processing
|
v
2. detectResourceType -> determine type (Html, Css, Binary, Svg, SiteMap, etc.)
|
v
3. createResource ----> build a Resource with save paths and relative replacement paths
|
v
4. processBeforeDownload -> filter/modify resources; link replacement in parent happens after this
|
v
5. download ----------> fetch resource via HTTP (loop ends early once body is set)
|
v
6. processAfterDownload -> parse content, discover child resources via submit() callback
|
v
7. saveToDisk --------> write to local filesystem
|
v
dispose (once per downloader shutdown / worker exit)
Consumers extend the pipeline by prepending or appending functions to any stage array via defaultLifeCycle(). See Usage for examples.
| Stage | Default handlers |
|---|---|
| linkRedirect | skipLinks - filters out non-HTTP URI schemes (mailto, javascript, data, etc.) |
| detectResourceType | detectResourceType - infers type from element/context |
| createResource | createResource - builds Resource with URL resolution, save path, and replace path |
| download | downloadResource, downloadStreamingResource, readOrCopyLocalResource |
| processAfterDownload | processRedirectedUrl, processHtml, processHtmlMetaRefresh, processSvg, processCss, processSiteMap |
| saveToDisk | saveHtmlToDisk, saveResourceToDisk |
Defined in ResourceType enum:
| Type | Encoding | Description |
|---|---|---|
Binary | null | Not parsed, saved as-is |
Html | utf8 | Parsed with cheerio, links discovered and rewritten |
Css | utf8 | CSS url() references extracted and rewritten |
CssInline | utf8 | Inline <style> blocks and style attributes |
SiteMap | utf8 | URLs discovered but not rewritten |
Svg | utf8 | Parsed with cheerio (same as HTML) |
StreamingBinary | null | Streamed directly to disk, for large files |
The scraper discovers linked resources from HTML using configurable source definitions. The defaults cover:
img[src], img[srcset], picture source[srcset]link[rel="stylesheet"], <style> blocks, [style] attributesscript[src]a[href], frame[src], iframe[src]video[src], video[poster], audio[src], source[src], track[src]*[xlink:href], *[href]meta[property="og:image"], og:audio, og:video and their variantsembed[src], object[data], input[src], [background], link[rel*="icon"], link[rel*="preload"]Override via options.sources with an array of {selector, attr, type} definitions.
Resource (src/resource.ts) - Central data object carrying URL, save path, replacement path, body, and metadata. RawResource is the serializable subset used for cross-thread communication.PipelineExecutor (interface in src/life-cycle/pipeline-executor.ts, impl in src/downloader/pipeline-executor-impl.ts) - Orchestrates life cycle execution. createAndProcessResource() runs stages 1-4 in one call.AbstractDownloader (src/downloader/main.ts) - Base class with PQueue-based concurrency, URL deduplication, and the download loop.SingleThreadDownloader (src/downloader/single.ts) - Runs all pipeline stages in the main thread.MultiThreadDownloader (src/downloader/multi.ts) - Downloads in main thread, sends to worker pool for post-processing.Use multi-thread processing when post-download work (HTML/CSS parsing, link discovery) is CPU-intensive.
Main thread:
Worker threads:
RawResource[]Worker count defaults to Math.min(concurrency, workerCount). The worker pool uses a 2-pass water-fill algorithm to balance tasks across workers by load.
The library uses log4js with dedicated logger categories:
| Logger | Purpose |
|---|---|
skip | Resources filtered/discarded at any pipeline stage |
skipExternal | External resources skipped by scope |
retry | HTTP retry attempts with backoff details |
error | Download and processing errors |
notFound | 404 responses |
request / response | HTTP request/response logging |
complete | Successfully processed resources |
mkdir | Directory creation |
adjustConcurrency | Runtime concurrency changes |
Configure logging via options.configureLogger and options.logSubDir.
url() extractionsrcset attribute parsingISC
FAQs
Configurable website scraper in typescript
The npm package website-scrap-engine receives a total of 342 weekly downloads. As such, website-scrap-engine popularity was classified as not popular.
We found that website-scrap-engine demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.