New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details
Socket
Book a DemoSign in
Socket

openscrape

Package Overview
Dependencies
Maintainers
1
Versions
8
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

openscrape

Open-source web scraping library with headless browser support, pagination, and data extraction

latest
Source
npmnpm
Version
1.0.8
Version published
Maintainers
1
Created
Source

OpenScrape

License: MIT Node.js Version

OpenScrape is a fully open-source web scraping library that mimics the core features of commercial scraping APIs. Built with TypeScript for Node.js 18+, it provides headless browser rendering, automatic pagination detection, clean data extraction, and both CLI and REST API interfaces.

Features

  • 🚀 Headless Browser Rendering - Full JavaScript rendering using Playwright
  • 📄 Pagination & Navigation - Automatic detection of "next" links and "load more" buttons
  • 🧹 Data Extraction & Normalization - Clean markdown or JSON output with noise removal
  • Rate Limiting & Concurrency - Safe request throttling with exponential backoff
  • 🖥️ CLI Interface - Easy-to-use command-line tools
  • 🌐 REST API - HTTP endpoints for programmatic access
  • 📡 WebSocket - Real-time job status updates over WebSocket
  • 📁 Media handling - Download images to an organized folder; optional base64-embed small images in JSON
  • 🔧 Extensible - Custom extraction schemas and pagination callbacks

Installation

npm install openscrape

Or install globally for CLI usage:

npm install -g openscrape

Important: After installation, you need to install Playwright browsers:

npx playwright install chromium

This downloads the Chromium browser required for headless rendering.

Docker

You can run OpenScrape in a container with no local Node or Playwright install.

Build the image:

docker build -t openscrape .

Run the API server (default; port 3000):

docker run -p 3000:3000 --init openscrape

Or with Docker Compose:

docker compose up --build

Then scrape via the API:

curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Use the CLI inside the container (crawl a URL, write output to a mounted volume):

docker run --rm -v "$(pwd)/out:/out" openscrape crawl https://example.com/article -o /out/article.json

Batch scrape (mount a file with URLs and an output directory):

docker run --rm -v "$(pwd)/urls.txt:/app/urls.txt" -v "$(pwd)/scraped:/out" openscrape batch /app/urls.txt --output-dir /out --format markdown

Custom command (override the default serve):

docker run --rm openscrape crawl https://example.com -o /tmp/out.json --format json

The image includes Chromium and its dependencies; the default command is serve --port 3000 --host 0.0.0.0. Use --init to avoid zombie processes. For large workloads, you may need to increase memory for the container.

Quick Start

CLI Usage

Scrape a single URL:

openscrape crawl https://example.com/article --output article.json

Scrape multiple URLs from a file:

openscrape batch urls.txt --output-dir ./scraped --format markdown

Start the API server:

openscrape serve --port 3000

Programmatic Usage

import { OpenScrape } from 'openscrape';

const scraper = new OpenScrape();

// Scrape a single URL
const data = await scraper.scrape({
  url: 'https://example.com/article',
  render: true,
  format: 'json',
  extractImages: true,
});

console.log(data.title);
console.log(data.content);
console.log(data.markdown);

await scraper.close();

REST API

Start the server:

openscrape serve

Scrape a URL:

curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Check job status:

curl http://localhost:3000/status/{jobId}

WebSocket (real-time updates)

When you run openscrape serve, the server also exposes a WebSocket endpoint at path /ws. Connect to receive real-time events for crawl jobs.

Endpoint: ws://localhost:3000/ws (or wss:// in production with TLS)

Subscribe to a job: Send a JSON message:

{ "type": "subscribe", "jobId": "<jobId>" }

Unsubscribe:

{ "type": "unsubscribe", "jobId": "<jobId>" }

Server events (you receive JSON):

EventWhen
job:createdA new crawl job was created
job:processingScraping has started
job:completedScraping finished; job.result has the data
job:failedScraping failed; job.error has the message

Example message:

{
  "event": "job:completed",
  "jobId": "abc-123",
  "job": {
    "id": "abc-123",
    "url": "https://example.com/article",
    "status": "completed",
    "result": { "url": "...", "title": "...", "content": "...", "markdown": "..." },
    "createdAt": "2025-01-15T12:00:00.000Z",
    "completedAt": "2025-01-15T12:00:05.000Z"
  },
  "timestamp": "2025-01-15T12:00:05.000Z"
}

Minimal client example (Node):

const WebSocket = require('ws');
const ws = new WebSocket('ws://localhost:3000/ws');

ws.on('open', () => {
  // First POST /crawl to get jobId, then:
  ws.send(JSON.stringify({ type: 'subscribe', jobId: 'YOUR_JOB_ID' }));
});
ws.on('message', (data) => {
  const msg = JSON.parse(data);
  console.log(msg.event, msg.job?.status, msg.job?.result?.title);
});

Configuration

Scrape Options

interface ScrapeOptions {
  url: string;                    // URL to scrape (required)
  render?: boolean;                // Enable JS rendering (default: true)
  waitTime?: number;              // Wait time after load in ms (default: 2000)
  maxDepth?: number;              // Max pagination depth (default: 10)
  nextSelector?: string;          // Custom CSS selector for next link
  paginationCallback?: Function;  // Custom pagination detection
  format?: 'json' | 'markdown' | 'html' | 'text' | 'csv' | 'yaml';  // Output format (default: 'json')
  extractionSchema?: object;      // Custom extraction schema
  autoDetectSchema?: boolean;     // Auto-detect schema from page (opt-in)
  schemaSamples?: string[];       // Sample URLs for schema detection (optional)
  llmExtract?: boolean;           // Use local LLM to extract structured JSON
  llmEndpoint?: string;           // Ollama or LM Studio endpoint URL
  llmModel?: string;              // Model name (default: 'llama2')
  userAgent?: string;             // Custom user agent
  proxy?: string | string[];      // Override proxy for this request (single URL or list for rotation)
  timeout?: number;               // Request timeout in ms (default: 30000)
  extractImages?: boolean;        // Extract images (default: true)
  extractMedia?: boolean;         // Extract embedded media (default: false)
  downloadMedia?: boolean;       // Download images to a local folder (default: false)
  mediaOutputDir?: string;       // Folder for downloads (default: ./media)
  base64EmbedImages?: boolean;   // Embed small images as base64 in JSON (default: false)
  base64EmbedMaxBytes?: number;  // Max size for embedding in bytes (default: 51200)
}

Media & asset handling

You can save images (and other assets) locally and optionally embed small images as base64 in JSON.

Download media to a folder (organized by site and path):

openscrape crawl https://example.com/article --output article.json --download-media --media-dir ./media

Folder structure: mediaOutputDir / hostname / path_slug / image_0.jpg, e.g. ./media/example.com/article/image_0.jpg.

Base64-embed small images in JSON (for self-contained output or small thumbnails):

openscrape crawl https://example.com/article --output article.json --embed-images --embed-images-max-size 51200
  • Only images under the size limit (default 50KB) are embedded.
  • Result includes mediaEmbedded: [{ url, dataUrl, mimeType }] with data:image/...;base64,... URLs.

Programmatic usage:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  downloadMedia: true,
  mediaOutputDir: './media',
  base64EmbedImages: true,
  base64EmbedMaxBytes: 51200,
});
// data.images       → original URLs
// data.mediaDownloads → [{ url, localPath, mimeType }]
// data.mediaEmbedded  → [{ url, dataUrl, mimeType }]

Output formats

Besides json and markdown, OpenScrape can output:

FormatUse case
htmlCleaned HTML (no scripts/nav); good for archiving or re-rendering.
textPlain text only; good for search indexes or NLP.
csvList/table-like pages: first <table> as rows; otherwise one row with url, title, author, content.
yamlFull structured data (url, title, author, content, images, etc.) in YAML.

Examples:

openscrape crawl https://example.com/article -o page.html --format html
openscrape crawl https://example.com/table -o data.csv --format csv
openscrape crawl https://example.com/article -o meta.yaml --format yaml

Programmatic usage: the scraper always returns full ScrapedData; use the formatters for string output:

import { OpenScrape, toHtml, toText, toCsv, toYaml } from 'openscrape';

const scraper = new OpenScrape();
const data = await scraper.scrape({ url: 'https://example.com/article' });
await scraper.close();

const htmlString = toHtml(data);
const textString = toText(data);
const csvString = toCsv(data);
const yamlString = toYaml(data);

LLM-based extraction (Ollama / LM Studio)

You can send the cleaned HTML or Markdown to a local LLM and get structured JSON (title, author, publishDate, content, metadata). Useful when pages have irregular structure.

Requirements: A local endpoint such as Ollama or LM Studio.

CLI:

# Ollama (default endpoint http://localhost:11434)
openscrape crawl https://example.com/article -o out.json --llm-extract --llm-model llama2

# Custom Ollama or LM Studio endpoint
openscrape crawl https://example.com/article -o out.json --llm-extract \
  --llm-endpoint http://localhost:1234/v1 --llm-model my-model

Programmatic:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  llmExtract: true,
  llmEndpoint: 'http://localhost:11434',  // Ollama
  llmModel: 'llama2',
});
// data is merged with LLM-extracted fields; on error, data.metadata.llmError is set
  • Ollama: use base URL (e.g. http://localhost:11434); the client calls /api/generate.
  • LM Studio: use the chat completions URL (e.g. http://localhost:1234/v1); the client calls /v1/chat/completions.

Auto-detect schema (opt-in)

With autoDetectSchema: true, OpenScrape infers an extraction schema from the page (e.g. title from <title> or og:title, content from article or .content). Use it when you don’t have a custom schema.

CLI:

openscrape crawl https://example.com/article -o out.json --auto-detect-schema

Programmatic:

const data = await scraper.scrape({
  url: 'https://example.com/article',
  autoDetectSchema: true,
});

You can also use the schema detector directly:

import { detectSchemaFromHtml } from 'openscrape';

const { schema, confidence, suggestions } = detectSchemaFromHtml(htmlString);

Custom Extraction Schema

const schema = {
  title: '.article-title',
  author: '.author-name',
  publishDate: '.publish-date',
  content: '.article-body',
  custom: [
    {
      name: 'category',
      selector: '.category',
    },
    {
      name: 'views',
      selector: '.views',
      transform: (value: string) => parseInt(value, 10),
    },
  ],
};

const data = await scraper.scrape({
  url: 'https://example.com/article',
  extractionSchema: schema,
});

Rate Limiting

const scraper = new OpenScrape({
  maxRequestsPerSecond: 5,
  maxConcurrency: 3,
});

Proxy support (rotating & residential)

Use a single proxy or a list for round-robin rotation. Supports auth (http://user:pass@host:port), SOCKS5 (socks5://host:port), and residential proxy lists.

  • Constructor: set a default proxy for all scrapes (single URL or array).
  • Per-scrape: override with options.proxy for that request.
  • Retries: on 403, 429, or timeout, the next proxy in the list is tried automatically.

Formats:

  • http://host:port or https://host:port
  • http://user:pass@host:port (auth)
  • socks5://host:port or socks5://user:pass@host:port

CLI:

# Single proxy
openscrape crawl https://example.com --proxy http://user:pass@proxy.example.com:8080 -o out.json

# Rotating list (comma-separated)
openscrape crawl https://example.com --proxy "http://p1:8080,http://p2:8080,socks5://p3:1080" -o out.json

# Batch with proxy list
openscrape batch urls.txt --proxy "http://user:pass@residential.example.com:8080" --output-dir ./out

Programmatic:

// Single proxy or rotating list at construction
const scraper = new OpenScrape({
  proxy: 'http://user:pass@proxy.example.com:8080',
  maxConcurrency: 3,
});

// Or pass an array for rotation
const scraper = new OpenScrape({
  proxy: ['http://p1:8080', 'socks5://p2:1080', 'http://user:pass@p3:8080'],
});

// Per-scrape override
const data = await scraper.scrape({
  url: 'https://example.com',
  proxy: 'socks5://localhost:1080',
});

Low-level: use parseProxyString(), normalizeProxyInput(), and ProxyPool from the package for custom rotation logic.

CLI Commands

crawl <URL>

Scrape a single URL and save to file.

Options:

  • -o, --output <path> - Output file path (default: output.json)
  • --no-render - Disable JavaScript rendering
  • --format <format> - Output format: json, markdown, html, text, csv, or yaml (default: json)
  • --wait-time <ms> - Wait time after page load (default: 2000)
  • --max-depth <number> - Maximum pagination depth (default: 10)
  • --next-selector <selector> - CSS selector for next link
  • --timeout <ms> - Request timeout (default: 30000)
  • --user-agent <ua> - Custom user agent string
  • --llm-extract - Use local LLM (Ollama/LM Studio) to extract structured data
  • --llm-endpoint <url> - LLM endpoint (e.g. http://localhost:11434 for Ollama)
  • --llm-model <name> - Model name for LLM extraction (default: llama2)
  • --auto-detect-schema - Auto-detect extraction schema from the page
  • --proxy <url> - Proxy URL or comma-separated list for rotation (http://user:pass@host:port, socks5://host:port)

Example:

openscrape crawl https://example.com/article \
  --output article.md \
  --format markdown \
  --max-depth 5

batch <file>

Scrape multiple URLs from a file (one URL per line).

Options:

  • -o, --output-dir <path> - Output directory (default: ./output)
  • --no-render - Disable JavaScript rendering
  • --format <format> - Output format: json, markdown, html, text, csv, or yaml (default: json)
  • --wait-time <ms> - Wait time after page load (default: 2000)
  • --max-depth <number> - Maximum pagination depth (default: 10)
  • --timeout <ms> - Request timeout (default: 30000)
  • --max-concurrency <number> - Maximum concurrent requests (default: 3)
  • --llm-extract - Use local LLM to extract structured data per URL
  • --llm-endpoint <url> - LLM endpoint URL
  • --llm-model <name> - Model name (default: llama2)
  • --auto-detect-schema - Auto-detect extraction schema from each page
  • --proxy <url> - Proxy URL or comma-separated list for rotation

Example:

openscrape batch urls.txt \
  --output-dir ./scraped \
  --format markdown \
  --max-concurrency 5

serve

Start the REST API server.

Options:

  • -p, --port <number> - Port number (default: 3000)
  • --host <host> - Host address (default: 0.0.0.0)

Example:

openscrape serve --port 8080

REST API Endpoints

POST /crawl

Scrape a URL asynchronously.

Request:

{
  "url": "https://example.com/article",
  "options": {
    "render": true,
    "format": "json",
    "maxDepth": 5
  }
}

Response:

{
  "jobId": "uuid-here",
  "status": "pending",
  "url": "https://example.com/article"
}

GET /status/:jobId

Get the status and result of a crawl job.

Response:

{
  "id": "uuid-here",
  "status": "completed",
  "url": "https://example.com/article",
  "createdAt": "2024-01-01T00:00:00.000Z",
  "completedAt": "2024-01-01T00:00:05.000Z",
  "result": {
    "url": "https://example.com/article",
    "title": "Article Title",
    "content": "...",
    "markdown": "...",
    "timestamp": "2024-01-01T00:00:05.000Z"
  }
}

GET /jobs

List all crawl jobs.

GET /health

Health check endpoint.

GET /about

Credits and repository info. Returns: { name, version, by, repository } (e.g. by: John F. Gonzales, repository: https://github.com/RantsRoamer/OpenScrape).

Development

Prerequisites

  • Node.js 18+
  • npm or yarn

Setup

git clone https://github.com/yourusername/openscrape.git
cd openscrape
npm install

Build

npm run build

Test

npm test

Lint

npm run lint

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by Firecrawl API
  • Built with Playwright
  • Uses Turndown for HTML to Markdown conversion

Roadmap

Keywords

web-scraping

FAQs

Package last updated on 14 Feb 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts