Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement
Sign In

github.com/sadewadee/foxhound

Package Overview
Dependencies
Versions
31
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/sadewadee/foxhound

Source
Go Modules
Version
v0.0.27
Version published
Created
Source
Foxhound - Go Scraping Framework

Go Scraping Framework with Native Camoufox Anti-Detection

Foxhound v0.0.27

High-performance Go scraping framework with native Camoufox anti-detection, dual-mode fetching, and 13-layer middleware.

Highlights

  • Dual-mode fetching: TLS-impersonating HTTP client (~5-50ms) + Camoufox browser (~500ms-5s), with automatic escalation on block detection
  • Consistent identity profiles: UA + TLS fingerprint + header order + OS + hardware + screen + locale + Accept-Language all match — randomness without consistency causes instant blocks; canonicalAcceptLanguage() pins BCP-47 strings to valid values at wire level
  • 13-layer middleware chain: concurrency, metrics, rate limit, robots.txt, delta-fetch, dedup, autothrottle, cookies, referer, blocked detector, redirect, depth limit, retry
  • Trail API: fluent navigation builder with Fill, InfiniteScroll, Evaluate (custom JS), XHR/fetch capture, and optional steps
  • Structured data extraction: JSON-LD, OpenGraph, NextData, NuxtData extractors + contact deobfuscation (CloudFlare cfemail)
  • NopeCHA auto-download: CAPTCHA-solving extension fetched and configured automatically at runtime
  • 9 export formats: JSON, JSONL, CSV, Markdown, Text, XML, SQLite, PostgreSQL, Webhook
  • Parsing engine: HTML table extraction (colspan/rowspan), JS preloaded data (Next.js/Nuxt/Redux), directory listings (JSON-LD/Microdata/DOM), pagination detection, and auto-detection with Readability-style article scoring
  • Adaptive parsing: CSS pseudo-selectors (::text, ::attr), similarity matching, auto-selector generation + sitemap/RSS/Atom parsing
  • Streaming API: Hunt.Stream(ctx) for real-time item processing via Go channels
  • Checkpoint/resume: auto-save hunt state every N items
  • Stateful Session: foxhound.NewSession(...) wraps fetcher + cookie jar + identity + proxy for single-call ad-hoc scraping, with cookies persisted across calls
  • Multi-session campaigns: Hunt.AddSession(name, cfg) + Job.SessionID route individual jobs through distinct fetchers / identities / proxies inside one Hunt
  • Development mode: Hunt.WithDevelopmentMode(dir) caches responses on disk after the first run and replays them on subsequent runs for zero-network iteration
  • Verified Cloudflare solve: fetch.WithSolveCloudflare(timeout) polls cookie + DOM + token signals before declaring success and exposes Response.CloudflareSolved
  • Domain & resource blocking: Hunt.WithBlockedDomains(...) / Hunt.WithDisableResources(...) abort ad, tracker, image, and font requests at the browser layer
  • Trail XHR capture: Trail.CaptureXHR(pattern) attaches URL regexps to every produced job so matching XHR/fetch response bodies land in Response.Captures
  • TLS fingerprint customisation (build tag tls): fetch.WithIdentity auto-applies the curated Firefox JA3 from fetch/presets; fetch.WithJA3, fetch.WithJA3Pool, fetch.WithHTTP2Fingerprint, fetch.WithHTTP3Fingerprint available for advanced overrides
  • Build-mode safety: StealthFetcher.IsImpersonating() + startup log so consumers fail-fast when built without -tags tls
  • 19 packages, 1200+ tests

Key Capabilities

AreaWhat you get
PerformanceCSS parsing in ~8ms for 5K elements. Multi-core goroutines with per-domain concurrency control
Anti-detectionReal Camoufox binary (C++ fingerprint spoofing), human behavior simulation (log-normal timing, Bezier mouse, scroll rhythm), NopeCHA auto-download
Block avoidance9 vendor patterns (Cloudflare, Akamai, DataDome, PerimeterX) with auto-retry + reCAPTCHA checkbox click + Turnstile handler
Identity60+ device profiles with consistent UA + TLS + headers + OS + GPU + screen + locale + geo matching
Trail APIFill forms (JobStepFill), infinite scroll with container + stop condition, Evaluate custom JS, XHR/fetch capture, optional steps, persistent cookies
ParsingCSS + XPath + regex + JSON + structured schema + adaptive selectors + similarity matching + pseudo-selectors + sitemap/RSS/Atom
Structured dataJSON-LD, OpenGraph, NextData, NuxtData extractors + CloudFlare cfemail deobfuscation
Export9 formats: JSON, JSONL, CSV, Markdown (table/list/cards), Text, XML, SQLite, PostgreSQL, Webhook + field-level pipeline transforms
ProxyPool rotation, health checking, cooldown, geo-targeted selection matching identity locale
QueueMemory, Redis (distributed), SQLite (persistent) — checkpoint/resume across restarts
MonitoringPrometheus metrics + webhook alerting with error/block rate thresholds
Scalingdocker compose --scale foxhound=4 with shared Redis queue

Quick Start

git clone https://github.com/sadewadee/foxhound.git
cd foxhound
go build -tags playwright -o foxhound ./cmd/foxhound/
foxhound init myproject && cd myproject
go mod tidy
foxhound run --config config.yaml

Google Maps — Scroll feed, collect businesses, extract contacts

// Generate a consistent identity (UA + TLS + headers + OS all match)
id := identity.Generate(identity.WithBrowser(identity.BrowserFirefox))
profile := behavior.CarefulProfile().Jitter() // ±15% per-session parameter variance

browser, _ := fetch.NewCamoufox(
    fetch.WithBrowserIdentity(id),
    fetch.WithBehaviorProfile(profile),
    fetch.WithStorageState("session.json"), // persist session across runs
)
defer browser.Close()

// SmartFetcher with Bayesian domain learning — auto-escalates to browser when blocked
scorer := fetch.NewDomainScorer(fetch.SocialMediaScoreConfig())
smart := fetch.NewSmart(static, browser, fetch.WithDomainScorer(scorer))

// Trail: search → scroll feed → collect all business URLs
trail := engine.NewTrail("maps-search").
    Navigate("https://www.google.com/maps").
    Fill("input#searchboxinput", "restaurant in bali").
    Click("button#searchbox-searchbutton").
    WaitOptional("div[role='feed']", 10*time.Second).
    InfiniteScrollInUntil("div[role='feed']", "div.Nv2PK", 50, 200).
    Evaluate(`() => document.querySelectorAll('.Nv2PK').length`)

h := engine.NewHunt(engine.HuntConfig{
    Name:            "maps",
    Walkers:         3,
    Seeds:           trail.ToJobs(),
    Fetcher:         middleware.Chain(
        middleware.NewCircuitBreaker(middleware.DefaultCircuitBreakerConfig()),
        middleware.NewAutoThrottle(middleware.AutoThrottleConfig{
            TargetConcurrency: 1, MinDelay: 2 * time.Second, MaxDelay: 15 * time.Second,
        }),
    ).Wrap(smart),
    Queue:           queue.NewReliable(queue.NewMemory(1000), queue.DefaultReliableConfig()),
    BehaviorProfile: profile,
    Processor: foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
        // Auto-detect page type and extract accordingly
        result, _ := parse.AutoExtract(resp)
        if result.Type == parse.ContentListing {
            var items []*foxhound.Item
            for _, l := range result.Listings {
                items = append(items, l.AsItem())
            }
            return &foxhound.Result{Items: items}, nil
        }
        // Fallback: extract contacts from business website
        item := foxhound.NewItem()
        item.Set("url", resp.URL)
        item.Set("emails", parse.ExtractEmails(resp))
        item.Set("phones", parse.ExtractPhones(resp))
        return &foxhound.Result{Items: []*foxhound.Item{item}}, nil
    }),
    Writers: []foxhound.Writer{jsonlWriter},
})
h.Run(context.Background())

Trail API — Login + Search + Infinite Scroll + JS Extract

// Login trail (reusable across sessions with WithStorageState)
login := engine.Login("ig-login",
    "https://www.instagram.com/accounts/login/",
    "input[name='username']", "input[name='password']", "button[type='submit']",
    os.Getenv("IG_USER"), os.Getenv("IG_PASS"),
)

// Feed scraping trail
feed := engine.NewTrail("ig-feed").
    Navigate("https://www.instagram.com/explore/").
    WaitOptional("article", 10*time.Second).
    InfiniteScrollUntil("article", 100, 500).
    Evaluate(`() => {
        const posts = document.querySelectorAll('a[href*="/p/"]');
        return Array.from(posts).map(a => a.href);
    }`)

Auto-Detection — Let foxhound figure out the page type

result, _ := parse.AutoExtract(resp)
switch result.Type {
case parse.ContentArticle:
    fmt.Println(result.Article.Title, result.Article.WordCount, "words")
case parse.ContentListing:
    for _, listing := range result.Listings {
        fmt.Println(listing.Name, listing.Phone, listing.Rating)
    }
case parse.ContentProduct:
    fmt.Println("Product page detected")
}

// Extract preloaded JS data (Next.js, Nuxt, Redux, Apollo)
data, _ := parse.ExtractPreloadedData(resp)
fmt.Println("Framework:", data.Framework) // "nextjs", "nuxt", "react"...

// Detect pagination and follow next pages
links := parse.DetectPagination(resp) // multi-signal scoring (50pt threshold)
for _, link := range links {
    fmt.Println(link.Direction, link.URL, "score:", link.Score)
}

Anti-fragility / Adaptive Selectors

Most scrapers break the moment a target site renames a CSS class. Foxhound's adaptive selectors learn an element signature (tag, classes, text prefix, parent, depth, position) on the first successful match, then fall back to similarity matching when the primary CSS selector stops working — so a class rename, a wrapper-div change, or a sibling reordering does not break extraction.

Enable adaptive mode on a Hunt with WithAdaptive(savePath) (pass an empty string for in-memory only, or a JSON path to persist learned signatures across runs), then use the adaptive helpers on Response:

hunt := engine.NewHunt(engine.HuntConfig{
    Name:      "shop",
    Fetcher:   fetcher,
    Queue:     q,
    Processor: foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
        // Inline: register and extract in one call. The signature is
        // learned automatically and persisted by the Hunt.
        title := resp.CSSAdaptive("h1.product-title", "title").Text()
        price := resp.CSSAdaptive(".price", "price").Text()

        // On future runs, even if .product-title gets renamed to
        // .item-name, similarity matching will recover the element.
        // Use Adaptive(name) for selectors registered earlier (e.g.
        // via Trail.Adaptive or a previous CSSAdaptive call).
        _ = resp.Adaptive("title")

        item := foxhound.NewItem()
        item.Set("title", title)
        item.Set("price", price)
        return &foxhound.Result{Items: []*foxhound.Item{item}}, nil
    }),
}).WithAdaptive("./adaptive_signatures.json")

You can also declare adaptive selectors at the Trail level:

trail := engine.NewTrail("books").
    Navigate("https://books.toscrape.com/").
    Adaptive("book_title", ".product_pod h3 a").
    Adaptive("book_price", ".product_pod .price_color")

See examples/adaptive/ for a complete runnable example demonstrating an adaptive selector surviving a CSS class rename.

TLS Fingerprint Customisation

fetch.NewStealth ships in two flavours selected by build tag:

  • default build (no tag): Go crypto/tls ClientHello — well-known JA3, trivially detected. Use for tests, CI, or non-bot-protected targets only.
  • -tags tls build: full JA3 / Akamai HTTP/2 / HTTP/3 impersonation via azuretls-client. Use for production scraping.

The same fetch.NewStealth API exists in both, but the underlying TLS layer is completely different. Confirm at startup:

f := fetch.NewStealth(fetch.WithIdentity(profile))
if !f.IsImpersonating() {
    log.Fatal("built without -tags tls; refusing to start in production")
}

Or check the binary directly:

go tool nm /path/to/binary | grep -q azuretls && echo "✅ TLS impersonation active" || echo "❌ Built without -tags tls"

TLS fingerprint comes from the identity

WithIdentity is the only thing you need for fingerprint consistency. It sets the azuretls browser family to match the profile ("firefox" for a Firefox profile — foxhound's primary target since Camoufox is Firefox-based) and lets azuretls's built-in GetLastFirefoxVersion produce the ClientHello at request time:

import "github.com/sadewadee/foxhound/fetch"

f := fetch.NewStealth(fetch.WithIdentity(profile))

The HTTP/2 layer is left to azuretls's browser-aware initHTTP2(browser) so TLS, headers, and HTTP/2 all agree on Firefox. Manual WithHTTP2Fingerprint is supported for power users but logs a startup warning when paired with WithJA3 (see issue #41).

Verified against https://www.bing.com/search and https://duckduckgo.com/ through a datacenter proxy: both return 200 with WithIdentity alone.

TLS certificate verification (v0.0.20)

NewStealth now sets InsecureSkipVerify=true by default. This disables azuretls's built-in DefaultPinManager, which performs an extra TLS handshake per new host to capture SPKI fingerprints and then fails on subsequent requests if a different CDN edge serves a different certificate. Multi-edge targets (Bing, Google, Cloudflare) rotate certificates continuously, making the default PinManager behaviour incompatible with sustained scraping.

foxhound's threat model is bot detection avoidance, not MITM prevention. The default is safe for scraping public sites over a controlled proxy path.

To re-enable full certificate chain, hostname, and pin verification:

f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithStrictTLSVerify(),   // re-enables chain + hostname + pin checks
)

The startup log includes tls_verify=true when strict mode is active, tls_verify=false (default).

Pin or rotate JA3 (advanced)

Capture your own Firefox JA3 from tls.peet.ws when the curated preset lags real Firefox:

f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithJA3(myCapturedJA3),       // overrides the auto-applied preset
)

For per-recycle rotation, supply a pool of multiple Firefox captures:

pool := []string{ja3FromYesterday, ja3FromLastWeek, presets.FirefoxLatest().JA3}
f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithJA3Pool(pool),
)

Without -tags tls these options compile but log an error at startup — the underlying net/http transport cannot customise the TLS ClientHello.

Locale policy for English-content scraping (v0.0.25)

By default, foxhound matches the identity locale to the proxy exit IP (anti-detection principle #6). When scraping English-language content through a proxy in a non-English-speaking country, the locale-query mismatch is itself a detection signal. Use LocalePolicyEnglishDefault to force en-US while keeping timezone and geo coordinates proxy-matched:

id := identity.Generate(
    identity.WithCountry("RU"),   // timezone=Europe/Moscow, geo=Moscow
    identity.WithLocalePolicy(identity.LocalePolicyEnglishDefault), // locale=en-US
)
// Accept-Language: en-US,en;q=0.5  (regardless of proxy country)
// navigator.language = "en-US"
// Timezone = "Europe/Moscow"       (unchanged — physical location is coherent)

An explicit WithLocale(locale, langs...) call always takes precedence over any policy.

Real Scraping Results

TargetModeItemsBlock AvoidanceNotes
Google Maps (10 queries)Camoufox + proxy100 places100%1,297 items/hour, 0 CAPTCHAs
Alibaba (yoga mat)Camoufox + proxy10 products100%Prices + suppliers extracted
bot.sannysoft.comCamoufox29/30 PASSwebdriver NOT detected
CreepJSCamoufoxTrust: HIGHFingerprint consistent

Benchmarks

Measured on hachibi (AMD Ryzen 7 5700G, Docker container, 2 cores / 4GB RAM, Ubuntu 24.04).

CSS Selection — 5,000 elements

LibraryLanguageTimevs Foxhound
Foxhound CSSGo13.6ms1.0x
Raw goqueryGo13.0ms0.96x
stdlib htmlGo17.7ms1.3x slower
Raw lxmlPython/C195.8ms14.4x slower
BeautifulSoupPython245.6ms18.1x slower

Foxhound Internal Benchmarks (5,000 elements)

MethodTimeMemoryAllocsNotes
Foxhound CSS13.6ms6.5 MB100K<1% overhead vs raw goquery
Foxhound Adaptive17.3ms6.2 MB95KZero overhead when selector works
Foxhound Schema31.3ms13.3 MB320K3 fields per item
Foxhound TextExtract22.5ms10.0 MB270K3 fields per item
FindByText24.6ms12.1 MB165KFull DOM text search
Regex extract6.7ms1.1 MB15KPattern matching on body
Similarity score96ns0 B0Zero allocation
Item.ToJSON1.2µs432 B10
Item.ToMarkdown716ns376 B8

Scaling by Document Size

Benchmark1K elements5K elements10K elementsScaling
Foxhound CSS2.3ms13.6ms29.6ms~linear
Regex extract1.5ms6.7ms15.7ms~linear
stdlib html3.1ms17.7ms31.4ms~linear
# Run yourself
go test -bench=. -benchmem ./benchmarks/

# Run in Docker with resource limits
docker run --cpus=2 --memory=4g foxhound-benchmark:latest \
  go test -bench=. -benchmem ./benchmarks/

Documentation

FileContents
docs/getting-started.mdInstall, first scrape, running modes
docs/configuration.mdFull config.yaml reference
docs/cli.mdAll CLI commands and flags
docs/api.mdGo types, interfaces, Hunt/Stream API
docs/anti-detection.mdIdentity system, TLS, behavior simulation
docs/parsing.mdTable, preload, directory, pagination, auto-detection parsers
docs/middleware.mdAll 13 middleware, chain order
docs/pipeline.mdPipeline stages and all 9 export formats
docs/proxy.mdProxy pool, rotation, providers, geo matching
docs/browser.mdCamoufox setup, options, human simulation
docs/examples.mdE-commerce, Maps, adaptive parsing, streaming
docs/deployment.mdDocker, scaling, environment variables

Export Formats

FormatConstructorNotes
JSON arrayexport.NewJSON(path, export.JSONArray)Single file, full array
JSON Linesexport.NewJSON(path, export.JSONLines)One object per line, streaming-friendly
CSVexport.NewCSV(path, cols...)Fixed or auto-inferred columns
Markdown tableexport.NewMarkdown(path, export.MarkdownTable)GFM pipe table
Markdown listexport.NewMarkdown(path, export.MarkdownList)Bullet list, first field bolded
Markdown cardsexport.NewMarkdown(path, export.MarkdownCards)H2 heading + bullet fields
Plain text linesexport.NewText(path, export.TextLines)key=value per line
Plain text prettyexport.NewText(path, export.TextPretty)Labelled blocks with separators
XMLexport.NewXML(path, root, item)Configurable root/item element names
SQLiteexport.NewSQLite(dbPath, table)Auto-creates and extends schema
PostgreSQLexport.NewPostgres(dsn, table)Upsert support, batch inserts
Webhookexport.NewWebhook(url)HTTP POST, optional batch size

Architecture

Job → rate limit → dedup → behavior timing → header enrichment
  → Smart Fetcher (static TLS or Camoufox browser)
    → Block detection (9 vendor patterns) → retry with backoff
  → Parser (CSS / XPath / JSON / Regex / Adaptive / Similarity)
  → User Process() → Result{Items, NextJobs}
  → Pipeline (validate, clean, dedup) → Writers (9 formats)
  → Queue (memory / Redis / SQLite)

License

MIT

FAQs

Package last updated on 28 May 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts