
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
Open-source web scraping library with headless browser support, pagination, and data extraction
OpenScrape is a fully open-source web scraping library that mimics the core features of commercial scraping APIs. Built with TypeScript for Node.js 18+, it provides headless browser rendering, automatic pagination detection, clean data extraction, and both CLI and REST API interfaces.
npm install openscrape
Or install globally for CLI usage:
npm install -g openscrape
Important: After installation, you need to install Playwright browsers:
npx playwright install chromium
This downloads the Chromium browser required for headless rendering.
You can run OpenScrape in a container with no local Node or Playwright install.
Build the image:
docker build -t openscrape .
Run the API server (default; port 3000):
docker run -p 3000:3000 --init openscrape
Or with Docker Compose:
docker compose up --build
Then scrape via the API:
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'
Use the CLI inside the container (crawl a URL, write output to a mounted volume):
docker run --rm -v "$(pwd)/out:/out" openscrape crawl https://example.com/article -o /out/article.json
Batch scrape (mount a file with URLs and an output directory):
docker run --rm -v "$(pwd)/urls.txt:/app/urls.txt" -v "$(pwd)/scraped:/out" openscrape batch /app/urls.txt --output-dir /out --format markdown
Custom command (override the default serve):
docker run --rm openscrape crawl https://example.com -o /tmp/out.json --format json
The image includes Chromium and its dependencies; the default command is serve --port 3000 --host 0.0.0.0. Use --init to avoid zombie processes. For large workloads, you may need to increase memory for the container.
Scrape a single URL:
openscrape crawl https://example.com/article --output article.json
Scrape multiple URLs from a file:
openscrape batch urls.txt --output-dir ./scraped --format markdown
Start the API server:
openscrape serve --port 3000
import { OpenScrape } from 'openscrape';
const scraper = new OpenScrape();
// Scrape a single URL
const data = await scraper.scrape({
url: 'https://example.com/article',
render: true,
format: 'json',
extractImages: true,
});
console.log(data.title);
console.log(data.content);
console.log(data.markdown);
await scraper.close();
Start the server:
openscrape serve
Scrape a URL:
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'
Check job status:
curl http://localhost:3000/status/{jobId}
When you run openscrape serve, the server also exposes a WebSocket endpoint at path /ws. Connect to receive real-time events for crawl jobs.
Endpoint: ws://localhost:3000/ws (or wss:// in production with TLS)
Subscribe to a job: Send a JSON message:
{ "type": "subscribe", "jobId": "<jobId>" }
Unsubscribe:
{ "type": "unsubscribe", "jobId": "<jobId>" }
Server events (you receive JSON):
| Event | When |
|---|---|
job:created | A new crawl job was created |
job:processing | Scraping has started |
job:completed | Scraping finished; job.result has the data |
job:failed | Scraping failed; job.error has the message |
Example message:
{
"event": "job:completed",
"jobId": "abc-123",
"job": {
"id": "abc-123",
"url": "https://example.com/article",
"status": "completed",
"result": { "url": "...", "title": "...", "content": "...", "markdown": "..." },
"createdAt": "2025-01-15T12:00:00.000Z",
"completedAt": "2025-01-15T12:00:05.000Z"
},
"timestamp": "2025-01-15T12:00:05.000Z"
}
Minimal client example (Node):
const WebSocket = require('ws');
const ws = new WebSocket('ws://localhost:3000/ws');
ws.on('open', () => {
// First POST /crawl to get jobId, then:
ws.send(JSON.stringify({ type: 'subscribe', jobId: 'YOUR_JOB_ID' }));
});
ws.on('message', (data) => {
const msg = JSON.parse(data);
console.log(msg.event, msg.job?.status, msg.job?.result?.title);
});
interface ScrapeOptions {
url: string; // URL to scrape (required)
render?: boolean; // Enable JS rendering (default: true)
waitTime?: number; // Wait time after load in ms (default: 2000)
maxDepth?: number; // Max pagination depth (default: 10)
nextSelector?: string; // Custom CSS selector for next link
paginationCallback?: Function; // Custom pagination detection
format?: 'json' | 'markdown' | 'html' | 'text' | 'csv' | 'yaml'; // Output format (default: 'json')
extractionSchema?: object; // Custom extraction schema
autoDetectSchema?: boolean; // Auto-detect schema from page (opt-in)
schemaSamples?: string[]; // Sample URLs for schema detection (optional)
llmExtract?: boolean; // Use local LLM to extract structured JSON
llmEndpoint?: string; // Ollama or LM Studio endpoint URL
llmModel?: string; // Model name (default: 'llama2')
userAgent?: string; // Custom user agent
proxy?: string | string[]; // Override proxy for this request (single URL or list for rotation)
timeout?: number; // Request timeout in ms (default: 30000)
extractImages?: boolean; // Extract images (default: true)
extractMedia?: boolean; // Extract embedded media (default: false)
downloadMedia?: boolean; // Download images to a local folder (default: false)
mediaOutputDir?: string; // Folder for downloads (default: ./media)
base64EmbedImages?: boolean; // Embed small images as base64 in JSON (default: false)
base64EmbedMaxBytes?: number; // Max size for embedding in bytes (default: 51200)
}
You can save images (and other assets) locally and optionally embed small images as base64 in JSON.
Download media to a folder (organized by site and path):
openscrape crawl https://example.com/article --output article.json --download-media --media-dir ./media
Folder structure: mediaOutputDir / hostname / path_slug / image_0.jpg, e.g. ./media/example.com/article/image_0.jpg.
Base64-embed small images in JSON (for self-contained output or small thumbnails):
openscrape crawl https://example.com/article --output article.json --embed-images --embed-images-max-size 51200
mediaEmbedded: [{ url, dataUrl, mimeType }] with data:image/...;base64,... URLs.Programmatic usage:
const data = await scraper.scrape({
url: 'https://example.com/article',
downloadMedia: true,
mediaOutputDir: './media',
base64EmbedImages: true,
base64EmbedMaxBytes: 51200,
});
// data.images → original URLs
// data.mediaDownloads → [{ url, localPath, mimeType }]
// data.mediaEmbedded → [{ url, dataUrl, mimeType }]
Besides json and markdown, OpenScrape can output:
| Format | Use case |
|---|---|
| html | Cleaned HTML (no scripts/nav); good for archiving or re-rendering. |
| text | Plain text only; good for search indexes or NLP. |
| csv | List/table-like pages: first <table> as rows; otherwise one row with url, title, author, content. |
| yaml | Full structured data (url, title, author, content, images, etc.) in YAML. |
Examples:
openscrape crawl https://example.com/article -o page.html --format html
openscrape crawl https://example.com/table -o data.csv --format csv
openscrape crawl https://example.com/article -o meta.yaml --format yaml
Programmatic usage: the scraper always returns full ScrapedData; use the formatters for string output:
import { OpenScrape, toHtml, toText, toCsv, toYaml } from 'openscrape';
const scraper = new OpenScrape();
const data = await scraper.scrape({ url: 'https://example.com/article' });
await scraper.close();
const htmlString = toHtml(data);
const textString = toText(data);
const csvString = toCsv(data);
const yamlString = toYaml(data);
You can send the cleaned HTML or Markdown to a local LLM and get structured JSON (title, author, publishDate, content, metadata). Useful when pages have irregular structure.
Requirements: A local endpoint such as Ollama or LM Studio.
CLI:
# Ollama (default endpoint http://localhost:11434)
openscrape crawl https://example.com/article -o out.json --llm-extract --llm-model llama2
# Custom Ollama or LM Studio endpoint
openscrape crawl https://example.com/article -o out.json --llm-extract \
--llm-endpoint http://localhost:1234/v1 --llm-model my-model
Programmatic:
const data = await scraper.scrape({
url: 'https://example.com/article',
llmExtract: true,
llmEndpoint: 'http://localhost:11434', // Ollama
llmModel: 'llama2',
});
// data is merged with LLM-extracted fields; on error, data.metadata.llmError is set
http://localhost:11434); the client calls /api/generate.http://localhost:1234/v1); the client calls /v1/chat/completions.With autoDetectSchema: true, OpenScrape infers an extraction schema from the page (e.g. title from <title> or og:title, content from article or .content). Use it when you don’t have a custom schema.
CLI:
openscrape crawl https://example.com/article -o out.json --auto-detect-schema
Programmatic:
const data = await scraper.scrape({
url: 'https://example.com/article',
autoDetectSchema: true,
});
You can also use the schema detector directly:
import { detectSchemaFromHtml } from 'openscrape';
const { schema, confidence, suggestions } = detectSchemaFromHtml(htmlString);
const schema = {
title: '.article-title',
author: '.author-name',
publishDate: '.publish-date',
content: '.article-body',
custom: [
{
name: 'category',
selector: '.category',
},
{
name: 'views',
selector: '.views',
transform: (value: string) => parseInt(value, 10),
},
],
};
const data = await scraper.scrape({
url: 'https://example.com/article',
extractionSchema: schema,
});
const scraper = new OpenScrape({
maxRequestsPerSecond: 5,
maxConcurrency: 3,
});
Use a single proxy or a list for round-robin rotation. Supports auth (http://user:pass@host:port), SOCKS5 (socks5://host:port), and residential proxy lists.
options.proxy for that request.Formats:
http://host:port or https://host:porthttp://user:pass@host:port (auth)socks5://host:port or socks5://user:pass@host:portCLI:
# Single proxy
openscrape crawl https://example.com --proxy http://user:pass@proxy.example.com:8080 -o out.json
# Rotating list (comma-separated)
openscrape crawl https://example.com --proxy "http://p1:8080,http://p2:8080,socks5://p3:1080" -o out.json
# Batch with proxy list
openscrape batch urls.txt --proxy "http://user:pass@residential.example.com:8080" --output-dir ./out
Programmatic:
// Single proxy or rotating list at construction
const scraper = new OpenScrape({
proxy: 'http://user:pass@proxy.example.com:8080',
maxConcurrency: 3,
});
// Or pass an array for rotation
const scraper = new OpenScrape({
proxy: ['http://p1:8080', 'socks5://p2:1080', 'http://user:pass@p3:8080'],
});
// Per-scrape override
const data = await scraper.scrape({
url: 'https://example.com',
proxy: 'socks5://localhost:1080',
});
Low-level: use parseProxyString(), normalizeProxyInput(), and ProxyPool from the package for custom rotation logic.
crawl <URL>Scrape a single URL and save to file.
Options:
-o, --output <path> - Output file path (default: output.json)--no-render - Disable JavaScript rendering--format <format> - Output format: json, markdown, html, text, csv, or yaml (default: json)--wait-time <ms> - Wait time after page load (default: 2000)--max-depth <number> - Maximum pagination depth (default: 10)--next-selector <selector> - CSS selector for next link--timeout <ms> - Request timeout (default: 30000)--user-agent <ua> - Custom user agent string--llm-extract - Use local LLM (Ollama/LM Studio) to extract structured data--llm-endpoint <url> - LLM endpoint (e.g. http://localhost:11434 for Ollama)--llm-model <name> - Model name for LLM extraction (default: llama2)--auto-detect-schema - Auto-detect extraction schema from the page--proxy <url> - Proxy URL or comma-separated list for rotation (http://user:pass@host:port, socks5://host:port)Example:
openscrape crawl https://example.com/article \
--output article.md \
--format markdown \
--max-depth 5
batch <file>Scrape multiple URLs from a file (one URL per line).
Options:
-o, --output-dir <path> - Output directory (default: ./output)--no-render - Disable JavaScript rendering--format <format> - Output format: json, markdown, html, text, csv, or yaml (default: json)--wait-time <ms> - Wait time after page load (default: 2000)--max-depth <number> - Maximum pagination depth (default: 10)--timeout <ms> - Request timeout (default: 30000)--max-concurrency <number> - Maximum concurrent requests (default: 3)--llm-extract - Use local LLM to extract structured data per URL--llm-endpoint <url> - LLM endpoint URL--llm-model <name> - Model name (default: llama2)--auto-detect-schema - Auto-detect extraction schema from each page--proxy <url> - Proxy URL or comma-separated list for rotationExample:
openscrape batch urls.txt \
--output-dir ./scraped \
--format markdown \
--max-concurrency 5
serveStart the REST API server.
Options:
-p, --port <number> - Port number (default: 3000)--host <host> - Host address (default: 0.0.0.0)Example:
openscrape serve --port 8080
POST /crawlScrape a URL asynchronously.
Request:
{
"url": "https://example.com/article",
"options": {
"render": true,
"format": "json",
"maxDepth": 5
}
}
Response:
{
"jobId": "uuid-here",
"status": "pending",
"url": "https://example.com/article"
}
GET /status/:jobIdGet the status and result of a crawl job.
Response:
{
"id": "uuid-here",
"status": "completed",
"url": "https://example.com/article",
"createdAt": "2024-01-01T00:00:00.000Z",
"completedAt": "2024-01-01T00:00:05.000Z",
"result": {
"url": "https://example.com/article",
"title": "Article Title",
"content": "...",
"markdown": "...",
"timestamp": "2024-01-01T00:00:05.000Z"
}
}
GET /jobsList all crawl jobs.
GET /healthHealth check endpoint.
GET /aboutCredits and repository info. Returns: { name, version, by, repository } (e.g. by: John F. Gonzales, repository: https://github.com/RantsRoamer/OpenScrape).
git clone https://github.com/yourusername/openscrape.git
cd openscrape
npm install
npm run build
npm test
npm run lint
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
Open-source web scraping library with headless browser support, pagination, and data extraction
We found that openscrape demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.