Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement โ†’
Sign In

crawlex

Package Overview
Dependencies
Maintainers
1
Versions
53
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

crawlex

Stealth crawler with Chrome-perfect TLS/H2 fingerprint, render pool, hooks, persistent queue

latest
Source
npmnpm
Version
1.0.4
Version published
Weekly downloads
180
-17.81%
Maintainers
1
Weekly downloads
ย 
Created
Source

๐Ÿ•ธ๏ธ crawlex

The stealth crawler that actually looks like Chrome.

TLS, HTTP/2, JS fingerprint โ€” every byte indistinguishable from real Chrome 149.
Rust core โ€ข Node SDK โ€ข Lua hooks โ€ข cross-platform binaries.

CI crates.io npm docs downloads license

pnpm add -g crawlex && crawlex pages run --seed https://example.com --method render

Quickstart ยท Features ยท Examples ยท Docs ยท Why crawlex

โšก Why crawlex

Standard crawlers fail on the first Cloudflare wall. crawlex arrives the way real Chrome arrives โ€” every fingerprint surface is identical, not approximated.

LayerWhat we match โ€” exactly, not approximately
๐Ÿ” TLS ClientHelloExtension order, ALPS, GREASE values, permute_extensions, X25519MLKEM768, signature algorithms โ€” verified against tls.peet.ws and ja4db.com oracles
๐Ÿšฆ HTTP/2 framePseudo-header order :method :authority :scheme :path, SETTINGS frame parameters, WINDOW_UPDATE pattern โ€” passes Akamai BMP signature checks
๐ŸŽญ JS fingerprint29-section stealth shim: navigator, chrome.*, permissions, plugins, screen, timezone, battery, WebGL (vendor / params / extensions), canvas (zero-preserving noise), AudioContext (FFT + offline render), Function.prototype.toString proxy, WebGPU, performance.memory, sensors, iframe, requestAnimationFrame throttle, performance.now() 100ยตs grain, mediaDevices, fonts, WebRTC SDP/ICE/getStats scrub
๐Ÿค– BehaviorMouse jitter, scroll cadence, dwell time, idle drift โ€” coherent motion:: profiles per persona
๐Ÿ“ฆ Catalog30 Chrome stable ร— 30 Chromium ร— 20 Firefox ร— Edge ร— Safari fingerprints. Era-fallback resolution: ask for chrome-149-linux, get the closest captured profile
๐Ÿ› ๏ธ Worker scopeSame shim auto-attached to dedicated / shared / service workers via CDP Target.setAutoAttach โ€” Camoufox port

โ†’ Validated against BrowserScan, CreepJS, Sannysoft, tls.peet.ws, ja4db.com.

๐Ÿš€ Install

# npm โ€” bundled binary download via postinstall
pnpm add -g crawlex

# Rust โ€” from source
cargo install crawlex

# Direct binary (linux x86_64/arm64, macOS x86_64/arm64, windows x86_64)
# https://github.com/forattini-dev/crawlex/releases/latest

โš ๏ธ Production crawls run locally, never in CI. Datacenter IPs (GitHub Actions, AWS, Azure) are flagged instantly by every modern WAF.

๐Ÿƒ Quickstart

# Stealth render with persona, sitemap discovery, NDJSON event stream
crawlex pages run \
  --seed https://target.com \
  --method render \
  --persona atlas \
  --max-depth 3 \
  --screenshot \
  --emit ndjson > events.ndjson

# Live tail what just happened
jq -c 'select(.event == "fetch.completed" or .event == "render.completed")' events.ndjson

Three integration paths, your pick:

CLINode SDKEmbedded Rust
crawlex pages run \
  --seed https://...\
  --method render \
  --persona pixel \
  --emit ndjson

One-shot crawls, scripted pipelines.

import { crawl, defineHooks } from 'crawlex';

for await (const ev of crawl({
  seeds: ['https://...'],
  args: { method: 'render' },
})) { ... }

Production services with hook logic.

use crawlex::{Crawler, Config};
let crawler = Crawler::new(
    Config::builder().build()?
)?;
crawler.run().await?;

In-process embedding, zero IPC.

๐ŸŽจ Examples

1. Hunt a SaaS product page with vitals + screenshot

import { crawl } from 'crawlex';

for await (const ev of crawl({
  seeds: ['https://stripe.com/pricing'],
  args: {
    method: 'render',
    persona: 'atlas',                 // macOS Apple M1, Retina, en-US
    screenshot: true,
    screenshotMode: 'fullpage',
    storage: 'filesystem',
    storagePath: './out',
    waitStrategy: '{"NetworkIdle":{"idle_ms":1500}}',
  },
})) {
  if (!('event' in ev)) continue;
  switch (ev.event) {
    case 'render.completed':
      console.log(`โœ… ${ev.url} | LCP=${ev.data.vitals.largest_contentful_paint_ms}ms | CLS=${ev.data.vitals.cumulative_layout_shift}`);
      break;
    case 'artifact.saved':
      if (ev.data.kind === 'screenshot.full_page')
        console.log(`๐Ÿ“ธ โ†’ out/${ev.data.path}  (${(ev.data.size/1024).toFixed(0)}kB)`);
      break;
    case 'challenge.detected':
      console.log(`๐Ÿšง ${ev.data.vendor} (${ev.data.level}) on ${ev.url}`);
      break;
  }
}

2. Crawl an entire domain with proxy rotation + retry policy

import { crawl, defineHooks } from 'crawlex';

const hooks = defineHooks({
  // Rate-limit retry: 429/503 โ†’ re-enqueue (up to retry_max)
  async onAfterFirstByte(ctx) {
    if (ctx.response_status === 429 || ctx.response_status === 503) return 'retry';
    return 'continue';
  },
  // Inject the canonical sitemap.xml for every host we touch
  async onDiscovery(ctx) {
    const host = new URL(ctx.url).host;
    return {
      decision: 'continue',
      patch: { capturedUrls: [...ctx.captured_urls, `https://${host}/sitemap.xml`] },
    };
  },
  // Tag the crawl with custom metadata that lands in user_data
  async onJobStart(ctx) {
    return {
      decision: 'continue',
      patch: { userData: { ...ctx.user_data, run_owner: 'qa-bot' } },
    };
  },
});

for await (const ev of crawl({
  seeds: ['https://target.com'],
  args: {
    method: 'auto',                   // policy engine picks http vs render
    maxConcurrentHttp: 8,
    maxConcurrentRender: 2,
    maxDepth: 5,
    crtsh: true,                      // certificate-transparency seeding
    storage: 'sqlite',
    storagePath: './crawl.db',
    queue: 'sqlite',
    queuePath: './crawl.db',
    proxies: ['http://user:pass@proxy1:8080', 'http://user:pass@proxy2:8080'],
    proxyStrategy: 'health-weighted',
    proxyStickyPerHost: true,
  },
  hooks,
  signal: AbortSignal.timeout(30 * 60_000),
})) {
  if (!('event' in ev)) continue;
  if (ev.event === 'job.failed') console.error(`โœ— ${ev.url} โ€” ${ev.data.error}`);
  if (ev.event === 'run.completed') console.log('done.');
}

3. Embedded library with custom Rust hooks

use crawlex::{Config, Crawler, queue::FetchMethod};
use crawlex::hooks::{HookDecision, HookRegistry};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

#[tokio::main]
async fn main() -> crawlex::Result<()> {
    let hooks = HookRegistry::new();
    let pages_seen = Arc::new(AtomicUsize::new(0));

    // Closure-captured counter โ€” observe without intervening
    let counter = pages_seen.clone();
    hooks.on_response_body(move |_ctx| {
        let c = counter.clone();
        Box::pin(async move {
            c.fetch_add(1, Ordering::Relaxed);
            Ok(HookDecision::Continue)
        })
    });

    // Domain-level deny list โ€” short-circuit before fetch
    hooks.on_before_each_request(|ctx| {
        let url = ctx.url.clone();
        Box::pin(async move {
            if url.path().starts_with("/admin/") { return Ok(HookDecision::Skip); }
            Ok(HookDecision::Continue)
        })
    });

    let config = Config::builder()
        .max_concurrent_http(16)
        .build()?;

    let crawler = Crawler::new(config)?.with_hooks(hooks);
    crawler.seed_with(
        vec!["https://target.com".parse().unwrap()],
        FetchMethod::HttpSpoof,
    ).await?;
    crawler.run().await?;

    println!("Crawled {} pages", pages_seen.load(Ordering::Relaxed));
    Ok(())
}

โ†’ Full runnable example: examples/embedded_with_hooks.rs

4. Pin a specific browser fingerprint from the catalog

# Browse 80+ ready-to-use fingerprints
crawlex stealth catalog list
crawlex stealth catalog list --filter chrome
crawlex stealth catalog show chrome-149-linux

# Pin a precise version + OS
crawlex pages run --seed https://target.com \
  --profile chrome-149-linux

# Era fallback: chromium-122 not captured? falls back to closest era + warns
crawlex pages run --seed https://target.com \
  --profile chromium-122-linux

# Mobile persona (touch viewport, sec-ch-ua-mobile: ?1)
crawlex pages run --seed https://target.com \
  --method render --persona pixel

5. Inspect what your stealth stack actually emits

# Print active IdentityBundle + TLS profile summary
crawlex stealth inspect --profile chrome-149-linux

# Verify ALPN/cipher/JA4 against built-in expectations
crawlex stealth test

# Compare against tls.peet.ws / ja4db.com via the live oracle
crawlex stealth catalog show chrome-149-linux --json

๐ŸŽฏ Features

๐Ÿฅท Stealth core

  • ๐Ÿ” Chrome 149 TLS via BoringSSL fork
  • ๐Ÿšฆ H2 pseudo-header order patch
  • ๐ŸŽญ 29-section JS shim โ€” full leak inventory covered
  • ๐Ÿค– Worker scope shim (dedicated / shared / SW)
  • ๐Ÿ“ฆ 80+ browser fingerprints from curl-impersonate + ja4db + tls.peet
  • ๐ŸŒ 5 personas: tux, office, gamer, atlas, pixel
  • ๐ŸŽฌ Coherent motion:: profiles (mouse / scroll / dwell)
  • ๐Ÿ•ธ๏ธ WebRTC scrub (SDP, ICE, getStats โ€” public-interface only)

๐Ÿ” Discovery

  • ๐Ÿ—บ๏ธ Sitemap recursion + robots.txt parsing
  • ๐Ÿ”Ž Certificate transparency (crt.sh)
  • ๐ŸŒ DNS records + RDAP + Wayback CDX
  • ๐Ÿ“œ PWA manifest + service worker probes
  • ๐Ÿ“‚ .well-known/* enumeration
  • ๐Ÿ”ฌ Tech fingerprinting (Wappalyzer-class)
  • ๐Ÿ”Œ JS endpoint extraction from runtime
  • ๐Ÿ›ก๏ธ security.txt parser
  • ๐Ÿงฌ Asset-ref classification (JS / CSS / image / API / nav)
  • ๐Ÿ”“ TCP port scan (opt-in, network-active)

๐Ÿ›ก๏ธ Antibot policy engine

  • ๐Ÿšง Detect: Cloudflare, DataDome, PerimeterX, Akamai BMP, Imperva, hCaptcha, reCAPTCHA, Turnstile
  • ๐Ÿ“Š Vendor telemetry observer (passive โ€” sees outbound calls to known endpoints)
  • ๐Ÿ”„ Policy decisions: keep / drop / retry / scope-demote / proxy-rotate / give-up
  • ๐ŸŽฏ 4 captcha solver adapters: in-house reCAPTCHA v3, 2captcha, anticaptcha, VLM

โš™๏ธ Pipeline

  • ๐ŸŽฏ Render pool โ€” Chromium auto-fetch + isolated user-data dirs
  • ๐Ÿ” Persistent queue: in-memory / SQLite / Redis backends
  • ๐Ÿ’พ Storage: filesystem / SQLite / memory โ€” opt-in per concern (artifact, state, challenge, telemetry, intel)
  • ๐Ÿ”„ Proxy rotator โ€” health checks + sticky sessions + per-host affinity
  • ๐Ÿ“Š Web Vitals + per-fetch network breakdown (DNS / TCP / TLS / TTFB / download)
  • ๐ŸŽฌ ScriptSpec runner โ€” declarative Plan execution with assertions
  • ๐Ÿ”ง Frontier with dedupe + rate-limit + retry policies
  • ๐Ÿ“ Wait strategies: Load, DOMContentLoaded, NetworkIdle, Selector, Fixed

๐Ÿ“ก Observability

  • ๐Ÿ“œ NDJSON event stream โ€” versioned envelope (v: 1)
  • ๐ŸŽฌ 19 event kinds covering full lifecycle
  • ๐Ÿ”ฌ Embedded WebVitals summary on render.completed
  • โฑ๏ธ Per-request timings on fetch.completed (ALPN, cipher, TLS version)
  • ๐Ÿ“ธ Artifact descriptors with on-disk path on the wire
  • ๐Ÿช Hooks: 12 lifecycle points ร— 3 languages (Rust / JS / Lua)
  • ๐Ÿ“Š Prometheus metrics endpoint

๐Ÿ”Œ Integrations

  • ๐Ÿ“ฆ npm + crates.io + GitHub Releases
  • ๐Ÿฆ€ Rust library โ€” embed Crawler directly
  • ๐Ÿ“˜ TypeScript types โ€” strict, full envelope coverage
  • ๐Ÿ”Œ SDK crawl() async iterator
  • ๐Ÿ“š docsify docs site (GitHub Pages)
  • ๐Ÿงช 386+ lib tests, 27 fpjs compliance, TLS catalog roundtrip suite
  • ๐Ÿ” Optional Lua hooks (mlua)
  • ๐Ÿชถ Two binaries: crawlex (full) + crawlex-mini (HTTP-only, no Chromium)

๐Ÿ“ก NDJSON event stream

Every run emits one JSON envelope per line on stdout. Versioned, stable, 19 kinds:

{"v":1,"event":"run.started","ts":"2026-04-26T19:42:00.000Z","run_id":42,"data":{"policy_profile":"strict","max_concurrent_http":8,"max_concurrent_render":2}}
{"v":1,"event":"job.started","run_id":42,"url":"https://target.com/","data":{"job_id":"j_001","method":"render","depth":0,"priority":0,"attempts":0}}
{"v":1,"event":"fetch.completed","run_id":42,"url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"bytes":98234,"body_truncated":false,"dns_ms":12,"tcp_connect_ms":18,"tls_handshake_ms":24,"ttfb_ms":142,"download_ms":83,"total_ms":280,"alpn":"h2","tls_version":"TLSv1.3","cipher":"TLS_AES_128_GCM_SHA256"}}
{"v":1,"event":"render.completed","run_id":42,"session_id":"sess_abc","url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"manifest":true,"service_workers":1,"is_spa":true,"vitals":{"ttfb_ms":142,"first_contentful_paint_ms":380.5,"largest_contentful_paint_ms":920.1,"cumulative_layout_shift":0.03,"total_blocking_time_ms":50.0,"dom_nodes":1842,"js_heap_used_bytes":12345678,"resource_count":45,"total_transfer_bytes":982341}}}
{"v":1,"event":"artifact.saved","run_id":42,"url":"https://target.com/","data":{"kind":"screenshot.full_page","mime":"image/png","size":1234567,"sha256":"a1b2c3...","path":"artifacts/sess_abc/1714123456_screenshot_full_page_a1b2c3d4.png"}}
{"v":1,"event":"challenge.detected","run_id":42,"url":"https://protected.com/","data":{"vendor":"cloudflare_turnstile","level":"widget_present"}}
{"v":1,"event":"decision.made","run_id":42,"url":"https://protected.com/","why":"render:js-challenge","data":{"decision":"retry","reason":{"code":"render:js-challenge"}}}
{"v":1,"event":"run.completed","run_id":42}

Discriminator key: event (snake_case) โ€” TypeScript narrows via switch (ev.event) { โ€ฆ }. Fallback for malformed lines: { kind: 'raw', line } so consumers can log/recover.

๐Ÿช Hooks โ€” 12 lifecycle points ร— 3 languages

before_each_request โ†’ after_dns โ†’ after_tls โ†’ after_first_byte โ†’ on_response_body
   โ†’ after_load โ†’ after_idle โ†’ on_discovery โ†’ on_job_start โ†’ on_job_end
   โ†’ on_error โ†’ on_robots_decision
LanguageAPIBest for
Rusthooks.on_after_first_byte(closure) โ€” full &mut HookContext accessEmbedded library, latency-critical paths
JS / TSdefineHooks({...}) via SDK โ€” IPC bridge, async closuresProduction crawls, business logic
Lua--hook-script foo.lua โ€” page-driving helpers (page_click, page_eval)Ad-hoc scripts, no build step

All three modes return the same decision: continue / skip / retry / abort. Hooks can mutate ctx.captured_urls, inject extra URLs, write to user_data to communicate with downstream hooks, or override robots_allowed.

๐ŸŽญ Personas โ€” coherent identity bundles

Each persona is a complete bundle โ€” UA + Sec-CH-UA + screen + viewport + DPR + GPU + fonts + media-device counts + TLS profile + motion timings โ€” so every signal matches. No mismatched UA + WebGL combo gives you away.

CodenameOSGPULocaleForm factor
๐Ÿง tuxLinuxIntel UHD 630en-USdesktop 1920ร—1080
๐Ÿข officeWindows 10Intel UHD 620en-USlaptop 1920ร—1080 (DPR 1.25)
๐ŸŽฎ gamerWindows 10NVIDIA GTX 1060pt-BRdesktop 1920ร—1080
๐ŸŽ atlasmacOSApple M1en-USretina 1440ร—900 (DPR 2.0)
๐Ÿ“ฑ pixelAndroid 14Adreno 640pt-BRmobile 412ร—823 (DPR 2.625)
crawlex pages run --seed https://target.com --persona atlas    # macOS
crawlex pages run --seed https://target.com --persona pixel    # mobile

๐Ÿ—๏ธ Architecture

flowchart LR
  S[Seeds] --> Q[Frontier<br/>+ dedupe + rate-limit]
  Q --> P[Policy Engine]
  P -->|http| F[ImpersonateClient<br/>BoringSSL + h2 patched]
  P -->|render| R[RenderPool<br/>Chromium + stealth shim]
  F --> X[Extractor<br/>+ Asset Refs]
  R --> X
  X --> D[Discovery<br/>Pipeline]
  X --> ST[Storage<br/>5 traits]
  D --> Q
  P --> EV[NDJSON Events<br/>19 kinds]
  R --> H1[Rust Hooks]
  R --> H2[JS Bridge]
  R --> H3[Lua Scripts]

Module map:

  • impersonate/ โ€” TLS catalog + BoringSSL connector + ALPS + GREASE
  • render/ โ€” Chromium pool + 29-section stealth shim + motion engine + ScriptSpec runner
  • discovery/ โ€” 17-stage pipeline (DNS, RDAP, sitemap, robots, crtsh, wayback, well-known, โ€ฆ)
  • policy/ โ€” pure engine: decide_pre_fetch, decide_post_fetch, decide_post_error, decide_post_challenge
  • antibot/ โ€” vendor classifier + 4 captcha solver adapters
  • storage/ โ€” 5 concern-oriented traits (artifact / state / challenge / telemetry / intel)
  • events/ โ€” NDJSON envelope + sink (stdout / null / memory)
  • hooks/ โ€” registry + JS bridge + Lua host

๐Ÿ› ๏ธ Tech stack

LayerImplementation
TLSboring-sys โ€” BoringSSL fork with ALPS / permute_extensions / X25519MLKEM768
HTTP/2Vendored h2 crate with pseudo-header order patch (vendor/h2)
CDPchromiumoxide-derived, embedded behind cdp-backend feature
Asynctokio multi-thread
Storagerusqlite (SQLite WAL), DashMap (memory), filesystem layout
Discoveryhickory-resolver (DNS), reqwest (RDAP), texting_robots (robots.txt)
Luamlua 0.10 (optional, lua-hooks feature)
SDKNode 20+, CommonJS, zero runtime deps

Two binaries ship from one source tree:

  • crawlex โ€” full build with HTTP impersonation + Chromium rendering + stealth shim + persistent queue
  • crawlex-mini โ€” HTTP-only worker, no Chromium dependency, same CLI surface (browser-only flags return Error::RenderDisabled)

๐Ÿ“Š Versus the alternatives

crawlexPlaywright stealthPuppeteer + pluginscurl-impersonate
TLS-perfect ClientHelloโœ… BoringSSLโš ๏ธ relies on Chromiumโš ๏ธ relies on Chromiumโœ…
H2 pseudo-header orderโœ… patched h2โš ๏ธ Chromium defaultโš ๏ธ Chromium defaultโŒ
29-section JS leak coverageโœ…โš ๏ธ partialโš ๏ธ via pluginsโŒ no JS
Worker-scope stealthโœ… auto-attachโš ๏ธ manualโš ๏ธ manualโŒ
HTTP-only path (no browser)โœ… crawlex-miniโŒโŒโœ…
Persistent queue + resumeโœ… SQLite/RedisโŒ externalโŒ externalโŒ
Discovery pipelineโœ… 17 stagesโŒโŒโŒ
Streaming NDJSON eventsโœ… versionedโŒโŒโŒ
Rust embeddingโœ…โŒโŒโš ๏ธ libcurl
Single binaryโœ…โŒโŒโœ…

๐Ÿ“š Documentation

๐Ÿค Contributing

git clone https://github.com/forattini-dev/crawlex
cd crawlex

# Unit tests + offline shim compliance
cargo test --lib                    # 386+ tests
cargo test --test fpjs_compliance   # 27 cases
cargo test --test tls_catalog_coverage --test tls_catalog_roundtrip

# SDK tests
pnpm test                           # 21 node:test cases

# Quality gates
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo publish --dry-run --locked

# Live integration tests (require system Chromium)
cargo test --all-features --test stealth_runtime_live -- --ignored
cargo test --all-features --test worker_shim_live -- --ignored

CI runs all of the above on every PR. Contributions welcome โ€” issues, feature requests, and PRs all reviewed.

๐Ÿ“„ License

Dual-licensed under MIT OR Apache-2.0 at your option. SPDX: MIT OR Apache-2.0.

Third-party attribution: see NOTICE.

Built for crawlers who refuse to be detected.

Docs ยท Releases ยท Issues ยท Discussions

Keywords

crawler

FAQs

Package last updated on 27 Apr 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts