
Research
/Security News
npm Package Uses Prompt Injection and Token Flooding to Disrupt AI Malware Scanners
A new npm package tests AI malware scanners with prompt injection, safety-triggering comments, context flooding, and obfuscated JavaScript.
@fre4x/benchmark
Advanced tools
This package exposes a consistent MCP workflow for deterministic benchmark-driven agent evaluation.
The rebuilt core is organized around challenge catalogs, typed task assets, and explicit checker kinds so coding, web, and OS-style tasks can share one MCP surface without relying on LLM judges.
| Tool | Purpose |
|---|---|
list_challenges | List deterministic benchmark suites with family, runner, and checker metadata |
get_catalog_status | Inspect catalog source configuration, cache state, and availability |
sync_catalog | Fetch and cache the remote benchmark catalog when a URL source is configured |
start_challenge | Start an attempt and return the first task |
submit_solution | Grade one task and return checker evidence plus the next task or final score |
get_asset | Read an attached benchmark asset by asset_id |
get_attempt | Inspect attempt status, current task, and paginated evaluation history |
cancel_attempt | Cancel an active attempt |
list_challengeschallenge_idstart_challengeget_assetsubmit_solutiondone: trueEach response includes machine-readable guidance for the most likely next tool call.
Run with the bundled fallback catalog and no extra configuration:
npx @fre4x/benchmark
Or from this repo:
cd /home/fritzprix/my_works/b1te
npm run inspector -w @fre4x/benchmark
Run with the same bundled fallback catalog in mock mode:
MOCK=true npx @fre4x/benchmark
BENCHMARK_CATALOG_FILE=/absolute/path/to/benchmark-catalog.json
BENCHMARK_CATALOG_URL=https://example.com/benchmark-catalog.json
BENCHMARK_CACHE_DIR=/absolute/path/to/catalog-cache
BENCHMARK_CACHE_TTL_SECONDS=3600
BENCHMARK_STATE_DIR=/absolute/path/to/store-attempt-json
BENCHMARK_MOCK=true
BENCHMARK_CATALOG_FILE: Optional JSON file with challenge definitions in the rebuilt deterministic catalog formatBENCHMARK_CATALOG_URL: Optional remote JSON catalog URL for fetch/cache based ingestionBENCHMARK_CACHE_DIR: Optional cache directory for remote catalog snapshotsBENCHMARK_CACHE_TTL_SECONDS: Freshness window for remote catalog cache reuseBENCHMARK_STATE_DIR: Where attempt files and lock directories are persistedBENCHMARK_MOCK: Alternate mock-mode flagBENCHMARK_GAIA_DATA_FILE is still accepted as a backward-compatible alias, but the rebuilt package is no longer GAIA-first.
When BENCHMARK_CATALOG_URL is set, the package will reuse a fresh cached copy when available and can be explicitly refreshed with sync_catalog.
External catalogs must be a JSON array of challenge definitions shaped like:
[
{
"challenge_id": "custom_suite",
"benchmark_id": "custom",
"family": "code",
"runner_kind": "code_runner",
"title": "Custom Challenge",
"description": "Deterministic single-task suite",
"version": "v1",
"source": "external",
"tasks": [
{
"task_id": "custom-1",
"title": "Return yes",
"prompt": "Return only yes.",
"response_format": "text",
"difficulty": 1,
"assets": [],
"checkers": [
{
"checker_id": "custom-yes",
"kind": "exact_text",
"expected": "yes"
}
]
}
]
}
]
Supported checker kinds today:
exact_textnormalized_textregex_matchcontains_all_textjson_field_equalsrunner_fact_equalsrunner_log_contains_textEach submission now also records runner execution metadata:
The repo includes the bundled fallback catalog source at:
benchmark/catalogs/expanded-catalog.json
Use it like this:
cd /home/fritzprix/my_works/b1te
BENCHMARK_CATALOG_FILE=/home/fritzprix/my_works/b1te/benchmark/catalogs/expanded-catalog.json npm run inspector -w @fre4x/benchmark
If no catalog env is provided at runtime, the published package falls back to the bundled copy of this catalog automatically.
{
"mcpServers": {
"benchmark": {
"command": "npx",
"args": ["-y", "@fre4x/benchmark"],
"env": {
"BENCHMARK_CATALOG_URL": "https://example.com/benchmark-catalog.json",
"BENCHMARK_CACHE_DIR": "/absolute/path/to/benchmark-cache"
}
}
}
}
npm install
npm run build -w @fre4x/benchmark
npm run typecheck -w @fre4x/benchmark
npm test -w @fre4x/benchmark
MOCK=true npm run inspector -w @fre4x/benchmark
FAQs
A deterministic benchmark MCP server for agent evaluation workflows.
We found that @fre4x/benchmark demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
/Security News
A new npm package tests AI malware scanners with prompt injection, safety-triggering comments, context flooding, and obfuscated JavaScript.

Product
Socket now detects supply chain risks in project manifests, starting with missing lockfiles that can make dependency installs non-reproducible.

Research
/Security News
The trojanized extensions use TinyGo-compiled WebAssembly and Solana transaction memos to resolve command-and-control infrastructure.