
Research
Shai-Hulud Descends to Hades: Miasma Worm Campaign Spreads with New PyPI Wave
Socket found 37 malicious PyPI wheels that abuse Python startup hooks to launch a Bun-powered credential stealer tied to Mini Shai-Hulud/Miasma.
github.com/sadewadee/foxhound
Advanced tools
Go Scraping Framework with Native Camoufox Anti-Detection
High-performance Go scraping framework with native Camoufox anti-detection, dual-mode fetching, and 13-layer middleware.
canonicalAcceptLanguage() pins BCP-47 strings to valid values at wire level::text, ::attr), similarity matching, auto-selector generation + sitemap/RSS/Atom parsingHunt.Stream(ctx) for real-time item processing via Go channelsfoxhound.NewSession(...) wraps fetcher + cookie jar + identity + proxy for single-call ad-hoc scraping, with cookies persisted across callsHunt.AddSession(name, cfg) + Job.SessionID route individual jobs through distinct fetchers / identities / proxies inside one HuntHunt.WithDevelopmentMode(dir) caches responses on disk after the first run and replays them on subsequent runs for zero-network iterationfetch.WithSolveCloudflare(timeout) polls cookie + DOM + token signals before declaring success and exposes Response.CloudflareSolvedHunt.WithBlockedDomains(...) / Hunt.WithDisableResources(...) abort ad, tracker, image, and font requests at the browser layerTrail.CaptureXHR(pattern) attaches URL regexps to every produced job so matching XHR/fetch response bodies land in Response.Capturestls): fetch.WithIdentity auto-applies the curated Firefox JA3 from fetch/presets; fetch.WithJA3, fetch.WithJA3Pool, fetch.WithHTTP2Fingerprint, fetch.WithHTTP3Fingerprint available for advanced overridesStealthFetcher.IsImpersonating() + startup log so consumers fail-fast when built without -tags tls| Area | What you get |
|---|---|
| Performance | CSS parsing in ~8ms for 5K elements. Multi-core goroutines with per-domain concurrency control |
| Anti-detection | Real Camoufox binary (C++ fingerprint spoofing), human behavior simulation (log-normal timing, Bezier mouse, scroll rhythm), NopeCHA auto-download |
| Block avoidance | 9 vendor patterns (Cloudflare, Akamai, DataDome, PerimeterX) with auto-retry + reCAPTCHA checkbox click + Turnstile handler |
| Identity | 60+ device profiles with consistent UA + TLS + headers + OS + GPU + screen + locale + geo matching |
| Trail API | Fill forms (JobStepFill), infinite scroll with container + stop condition, Evaluate custom JS, XHR/fetch capture, optional steps, persistent cookies |
| Parsing | CSS + XPath + regex + JSON + structured schema + adaptive selectors + similarity matching + pseudo-selectors + sitemap/RSS/Atom |
| Structured data | JSON-LD, OpenGraph, NextData, NuxtData extractors + CloudFlare cfemail deobfuscation |
| Export | 9 formats: JSON, JSONL, CSV, Markdown (table/list/cards), Text, XML, SQLite, PostgreSQL, Webhook + field-level pipeline transforms |
| Proxy | Pool rotation, health checking, cooldown, geo-targeted selection matching identity locale |
| Queue | Memory, Redis (distributed), SQLite (persistent) — checkpoint/resume across restarts |
| Monitoring | Prometheus metrics + webhook alerting with error/block rate thresholds |
| Scaling | docker compose --scale foxhound=4 with shared Redis queue |
git clone https://github.com/sadewadee/foxhound.git
cd foxhound
go build -tags playwright -o foxhound ./cmd/foxhound/
foxhound init myproject && cd myproject
go mod tidy
foxhound run --config config.yaml
// Generate a consistent identity (UA + TLS + headers + OS all match)
id := identity.Generate(identity.WithBrowser(identity.BrowserFirefox))
profile := behavior.CarefulProfile().Jitter() // ±15% per-session parameter variance
browser, _ := fetch.NewCamoufox(
fetch.WithBrowserIdentity(id),
fetch.WithBehaviorProfile(profile),
fetch.WithStorageState("session.json"), // persist session across runs
)
defer browser.Close()
// SmartFetcher with Bayesian domain learning — auto-escalates to browser when blocked
scorer := fetch.NewDomainScorer(fetch.SocialMediaScoreConfig())
smart := fetch.NewSmart(static, browser, fetch.WithDomainScorer(scorer))
// Trail: search → scroll feed → collect all business URLs
trail := engine.NewTrail("maps-search").
Navigate("https://www.google.com/maps").
Fill("input#searchboxinput", "restaurant in bali").
Click("button#searchbox-searchbutton").
WaitOptional("div[role='feed']", 10*time.Second).
InfiniteScrollInUntil("div[role='feed']", "div.Nv2PK", 50, 200).
Evaluate(`() => document.querySelectorAll('.Nv2PK').length`)
h := engine.NewHunt(engine.HuntConfig{
Name: "maps",
Walkers: 3,
Seeds: trail.ToJobs(),
Fetcher: middleware.Chain(
middleware.NewCircuitBreaker(middleware.DefaultCircuitBreakerConfig()),
middleware.NewAutoThrottle(middleware.AutoThrottleConfig{
TargetConcurrency: 1, MinDelay: 2 * time.Second, MaxDelay: 15 * time.Second,
}),
).Wrap(smart),
Queue: queue.NewReliable(queue.NewMemory(1000), queue.DefaultReliableConfig()),
BehaviorProfile: profile,
Processor: foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
// Auto-detect page type and extract accordingly
result, _ := parse.AutoExtract(resp)
if result.Type == parse.ContentListing {
var items []*foxhound.Item
for _, l := range result.Listings {
items = append(items, l.AsItem())
}
return &foxhound.Result{Items: items}, nil
}
// Fallback: extract contacts from business website
item := foxhound.NewItem()
item.Set("url", resp.URL)
item.Set("emails", parse.ExtractEmails(resp))
item.Set("phones", parse.ExtractPhones(resp))
return &foxhound.Result{Items: []*foxhound.Item{item}}, nil
}),
Writers: []foxhound.Writer{jsonlWriter},
})
h.Run(context.Background())
// Login trail (reusable across sessions with WithStorageState)
login := engine.Login("ig-login",
"https://www.instagram.com/accounts/login/",
"input[name='username']", "input[name='password']", "button[type='submit']",
os.Getenv("IG_USER"), os.Getenv("IG_PASS"),
)
// Feed scraping trail
feed := engine.NewTrail("ig-feed").
Navigate("https://www.instagram.com/explore/").
WaitOptional("article", 10*time.Second).
InfiniteScrollUntil("article", 100, 500).
Evaluate(`() => {
const posts = document.querySelectorAll('a[href*="/p/"]');
return Array.from(posts).map(a => a.href);
}`)
result, _ := parse.AutoExtract(resp)
switch result.Type {
case parse.ContentArticle:
fmt.Println(result.Article.Title, result.Article.WordCount, "words")
case parse.ContentListing:
for _, listing := range result.Listings {
fmt.Println(listing.Name, listing.Phone, listing.Rating)
}
case parse.ContentProduct:
fmt.Println("Product page detected")
}
// Extract preloaded JS data (Next.js, Nuxt, Redux, Apollo)
data, _ := parse.ExtractPreloadedData(resp)
fmt.Println("Framework:", data.Framework) // "nextjs", "nuxt", "react"...
// Detect pagination and follow next pages
links := parse.DetectPagination(resp) // multi-signal scoring (50pt threshold)
for _, link := range links {
fmt.Println(link.Direction, link.URL, "score:", link.Score)
}
Most scrapers break the moment a target site renames a CSS class. Foxhound's adaptive selectors learn an element signature (tag, classes, text prefix, parent, depth, position) on the first successful match, then fall back to similarity matching when the primary CSS selector stops working — so a class rename, a wrapper-div change, or a sibling reordering does not break extraction.
Enable adaptive mode on a Hunt with WithAdaptive(savePath) (pass an empty string for in-memory only, or a JSON path to persist learned signatures across runs), then use the adaptive helpers on Response:
hunt := engine.NewHunt(engine.HuntConfig{
Name: "shop",
Fetcher: fetcher,
Queue: q,
Processor: foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
// Inline: register and extract in one call. The signature is
// learned automatically and persisted by the Hunt.
title := resp.CSSAdaptive("h1.product-title", "title").Text()
price := resp.CSSAdaptive(".price", "price").Text()
// On future runs, even if .product-title gets renamed to
// .item-name, similarity matching will recover the element.
// Use Adaptive(name) for selectors registered earlier (e.g.
// via Trail.Adaptive or a previous CSSAdaptive call).
_ = resp.Adaptive("title")
item := foxhound.NewItem()
item.Set("title", title)
item.Set("price", price)
return &foxhound.Result{Items: []*foxhound.Item{item}}, nil
}),
}).WithAdaptive("./adaptive_signatures.json")
You can also declare adaptive selectors at the Trail level:
trail := engine.NewTrail("books").
Navigate("https://books.toscrape.com/").
Adaptive("book_title", ".product_pod h3 a").
Adaptive("book_price", ".product_pod .price_color")
See examples/adaptive/ for a complete runnable example demonstrating an adaptive selector surviving a CSS class rename.
fetch.NewStealth ships in two flavours selected by build tag:
crypto/tls ClientHello — well-known JA3, trivially detected. Use for tests, CI, or non-bot-protected targets only.-tags tls build: full JA3 / Akamai HTTP/2 / HTTP/3 impersonation via azuretls-client. Use for production scraping.The same fetch.NewStealth API exists in both, but the underlying TLS layer is completely different. Confirm at startup:
f := fetch.NewStealth(fetch.WithIdentity(profile))
if !f.IsImpersonating() {
log.Fatal("built without -tags tls; refusing to start in production")
}
Or check the binary directly:
go tool nm /path/to/binary | grep -q azuretls && echo "✅ TLS impersonation active" || echo "❌ Built without -tags tls"
WithIdentity is the only thing you need for fingerprint consistency. It sets the azuretls browser family to match the profile ("firefox" for a Firefox profile — foxhound's primary target since Camoufox is Firefox-based) and lets azuretls's built-in GetLastFirefoxVersion produce the ClientHello at request time:
import "github.com/sadewadee/foxhound/fetch"
f := fetch.NewStealth(fetch.WithIdentity(profile))
The HTTP/2 layer is left to azuretls's browser-aware initHTTP2(browser) so TLS, headers, and HTTP/2 all agree on Firefox. Manual WithHTTP2Fingerprint is supported for power users but logs a startup warning when paired with WithJA3 (see issue #41).
Verified against https://www.bing.com/search and https://duckduckgo.com/ through a datacenter proxy: both return 200 with WithIdentity alone.
NewStealth now sets InsecureSkipVerify=true by default. This disables azuretls's built-in DefaultPinManager, which performs an extra TLS handshake per new host to capture SPKI fingerprints and then fails on subsequent requests if a different CDN edge serves a different certificate. Multi-edge targets (Bing, Google, Cloudflare) rotate certificates continuously, making the default PinManager behaviour incompatible with sustained scraping.
foxhound's threat model is bot detection avoidance, not MITM prevention. The default is safe for scraping public sites over a controlled proxy path.
To re-enable full certificate chain, hostname, and pin verification:
f := fetch.NewStealth(
fetch.WithIdentity(profile),
fetch.WithStrictTLSVerify(), // re-enables chain + hostname + pin checks
)
The startup log includes tls_verify=true when strict mode is active, tls_verify=false (default).
Capture your own Firefox JA3 from tls.peet.ws when the curated preset lags real Firefox:
f := fetch.NewStealth(
fetch.WithIdentity(profile),
fetch.WithJA3(myCapturedJA3), // overrides the auto-applied preset
)
For per-recycle rotation, supply a pool of multiple Firefox captures:
pool := []string{ja3FromYesterday, ja3FromLastWeek, presets.FirefoxLatest().JA3}
f := fetch.NewStealth(
fetch.WithIdentity(profile),
fetch.WithJA3Pool(pool),
)
Without -tags tls these options compile but log an error at startup — the underlying net/http transport cannot customise the TLS ClientHello.
By default, foxhound matches the identity locale to the proxy exit IP (anti-detection principle #6). When scraping English-language content through a proxy in a non-English-speaking country, the locale-query mismatch is itself a detection signal. Use LocalePolicyEnglishDefault to force en-US while keeping timezone and geo coordinates proxy-matched:
id := identity.Generate(
identity.WithCountry("RU"), // timezone=Europe/Moscow, geo=Moscow
identity.WithLocalePolicy(identity.LocalePolicyEnglishDefault), // locale=en-US
)
// Accept-Language: en-US,en;q=0.5 (regardless of proxy country)
// navigator.language = "en-US"
// Timezone = "Europe/Moscow" (unchanged — physical location is coherent)
An explicit WithLocale(locale, langs...) call always takes precedence over any policy.
| Target | Mode | Items | Block Avoidance | Notes |
|---|---|---|---|---|
| Google Maps (10 queries) | Camoufox + proxy | 100 places | 100% | 1,297 items/hour, 0 CAPTCHAs |
| Alibaba (yoga mat) | Camoufox + proxy | 10 products | 100% | Prices + suppliers extracted |
| bot.sannysoft.com | Camoufox | 29/30 PASS | — | webdriver NOT detected |
| CreepJS | Camoufox | Trust: HIGH | — | Fingerprint consistent |
Measured on hachibi (AMD Ryzen 7 5700G, Docker container, 2 cores / 4GB RAM, Ubuntu 24.04).
| Library | Language | Time | vs Foxhound |
|---|---|---|---|
| Foxhound CSS | Go | 13.6ms | 1.0x |
| Raw goquery | Go | 13.0ms | 0.96x |
| stdlib html | Go | 17.7ms | 1.3x slower |
| Raw lxml | Python/C | 195.8ms | 14.4x slower |
| BeautifulSoup | Python | 245.6ms | 18.1x slower |
| Method | Time | Memory | Allocs | Notes |
|---|---|---|---|---|
| Foxhound CSS | 13.6ms | 6.5 MB | 100K | <1% overhead vs raw goquery |
| Foxhound Adaptive | 17.3ms | 6.2 MB | 95K | Zero overhead when selector works |
| Foxhound Schema | 31.3ms | 13.3 MB | 320K | 3 fields per item |
| Foxhound TextExtract | 22.5ms | 10.0 MB | 270K | 3 fields per item |
| FindByText | 24.6ms | 12.1 MB | 165K | Full DOM text search |
| Regex extract | 6.7ms | 1.1 MB | 15K | Pattern matching on body |
| Similarity score | 96ns | 0 B | 0 | Zero allocation |
| Item.ToJSON | 1.2µs | 432 B | 10 | — |
| Item.ToMarkdown | 716ns | 376 B | 8 | — |
| Benchmark | 1K elements | 5K elements | 10K elements | Scaling |
|---|---|---|---|---|
| Foxhound CSS | 2.3ms | 13.6ms | 29.6ms | ~linear |
| Regex extract | 1.5ms | 6.7ms | 15.7ms | ~linear |
| stdlib html | 3.1ms | 17.7ms | 31.4ms | ~linear |
# Run yourself
go test -bench=. -benchmem ./benchmarks/
# Run in Docker with resource limits
docker run --cpus=2 --memory=4g foxhound-benchmark:latest \
go test -bench=. -benchmem ./benchmarks/
| File | Contents |
|---|---|
| docs/getting-started.md | Install, first scrape, running modes |
| docs/configuration.md | Full config.yaml reference |
| docs/cli.md | All CLI commands and flags |
| docs/api.md | Go types, interfaces, Hunt/Stream API |
| docs/anti-detection.md | Identity system, TLS, behavior simulation |
| docs/parsing.md | Table, preload, directory, pagination, auto-detection parsers |
| docs/middleware.md | All 13 middleware, chain order |
| docs/pipeline.md | Pipeline stages and all 9 export formats |
| docs/proxy.md | Proxy pool, rotation, providers, geo matching |
| docs/browser.md | Camoufox setup, options, human simulation |
| docs/examples.md | E-commerce, Maps, adaptive parsing, streaming |
| docs/deployment.md | Docker, scaling, environment variables |
| Format | Constructor | Notes |
|---|---|---|
| JSON array | export.NewJSON(path, export.JSONArray) | Single file, full array |
| JSON Lines | export.NewJSON(path, export.JSONLines) | One object per line, streaming-friendly |
| CSV | export.NewCSV(path, cols...) | Fixed or auto-inferred columns |
| Markdown table | export.NewMarkdown(path, export.MarkdownTable) | GFM pipe table |
| Markdown list | export.NewMarkdown(path, export.MarkdownList) | Bullet list, first field bolded |
| Markdown cards | export.NewMarkdown(path, export.MarkdownCards) | H2 heading + bullet fields |
| Plain text lines | export.NewText(path, export.TextLines) | key=value per line |
| Plain text pretty | export.NewText(path, export.TextPretty) | Labelled blocks with separators |
| XML | export.NewXML(path, root, item) | Configurable root/item element names |
| SQLite | export.NewSQLite(dbPath, table) | Auto-creates and extends schema |
| PostgreSQL | export.NewPostgres(dsn, table) | Upsert support, batch inserts |
| Webhook | export.NewWebhook(url) | HTTP POST, optional batch size |
Job → rate limit → dedup → behavior timing → header enrichment
→ Smart Fetcher (static TLS or Camoufox browser)
→ Block detection (9 vendor patterns) → retry with backoff
→ Parser (CSS / XPath / JSON / Regex / Adaptive / Similarity)
→ User Process() → Result{Items, NextJobs}
→ Pipeline (validate, clean, dedup) → Writers (9 formats)
→ Queue (memory / Redis / SQLite)
MIT
FAQs
Unknown package
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Socket found 37 malicious PyPI wheels that abuse Python startup hooks to launch a Bun-powered credential stealer tied to Mini Shai-Hulud/Miasma.

Security News
RubyGems and Bundler 4.0.13 introduced an opt-in cooldown feature that delays newly published gems during dependency resolution.

Security News
pnpm 11.5 now recognizes npm staged publish approvals in release metadata, preventing those releases from being mistaken for lower-trust package publishes.