🚀. Socket Launch Week Day 2:Introducing Manifest Alerts.Learn more →

@fre4x/benchmark

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

@fre4x/benchmark

A deterministic benchmark MCP server for agent evaluation workflows.

latest

npm

Version: 1.1.1

Version published: last week

Weekly downloads: 15

Maintainers: 1

Weekly downloads

Created: last month

Source

benchmark — Deterministic Agent Evaluation

This package exposes a consistent MCP workflow for deterministic benchmark-driven agent evaluation.

The rebuilt core is organized around challenge catalogs, typed task assets, and explicit checker kinds so coding, web, and OS-style tasks can share one MCP surface without relying on LLM judges.

Tools

Tool	Purpose
`list_challenges`	List deterministic benchmark suites with family, runner, and checker metadata
`get_catalog_status`	Inspect catalog source configuration, cache state, and availability
`sync_catalog`	Fetch and cache the remote benchmark catalog when a URL source is configured
`start_challenge`	Start an attempt and return the first task
`submit_solution`	Grade one task and return checker evidence plus the next task or final score
`get_asset`	Read an attached benchmark asset by `asset_id`
`get_attempt`	Inspect attempt status, current task, and paginated evaluation history
`cancel_attempt`	Cancel an active attempt

Workflow

Call list_challenges
Pick a challenge_id
Call start_challenge
If the task has assets, call get_asset
Call submit_solution
Repeat until done: true

Each response includes machine-readable guidance for the most likely next tool call.

Fallback benchmark families

Code — deterministic JSON/text answers backed by explicit checkers
Web — DOM snapshot extraction tasks with JSON field assertions
OS — filesystem/log review tasks with deterministic text grading

Zero-config run

Run with the bundled fallback catalog and no extra configuration:

npx @fre4x/benchmark

Or from this repo:

cd /home/fritzprix/my_works/b1te
npm run inspector -w @fre4x/benchmark

Mock Mode

Run with the same bundled fallback catalog in mock mode:

MOCK=true npx @fre4x/benchmark

Optional environment

BENCHMARK_CATALOG_FILE=/absolute/path/to/benchmark-catalog.json
BENCHMARK_CATALOG_URL=https://example.com/benchmark-catalog.json
BENCHMARK_CACHE_DIR=/absolute/path/to/catalog-cache
BENCHMARK_CACHE_TTL_SECONDS=3600
BENCHMARK_STATE_DIR=/absolute/path/to/store-attempt-json
BENCHMARK_MOCK=true

BENCHMARK_CATALOG_FILE: Optional JSON file with challenge definitions in the rebuilt deterministic catalog format
BENCHMARK_CATALOG_URL: Optional remote JSON catalog URL for fetch/cache based ingestion
BENCHMARK_CACHE_DIR: Optional cache directory for remote catalog snapshots
BENCHMARK_CACHE_TTL_SECONDS: Freshness window for remote catalog cache reuse
BENCHMARK_STATE_DIR: Where attempt files and lock directories are persisted
BENCHMARK_MOCK: Alternate mock-mode flag

BENCHMARK_GAIA_DATA_FILE is still accepted as a backward-compatible alias, but the rebuilt package is no longer GAIA-first.

When BENCHMARK_CATALOG_URL is set, the package will reuse a fresh cached copy when available and can be explicitly refreshed with sync_catalog.

Catalog shape

External catalogs must be a JSON array of challenge definitions shaped like:

[
  {
    "challenge_id": "custom_suite",
    "benchmark_id": "custom",
    "family": "code",
    "runner_kind": "code_runner",
    "title": "Custom Challenge",
    "description": "Deterministic single-task suite",
    "version": "v1",
    "source": "external",
    "tasks": [
      {
        "task_id": "custom-1",
        "title": "Return yes",
        "prompt": "Return only yes.",
        "response_format": "text",
        "difficulty": 1,
        "assets": [],
        "checkers": [
          {
            "checker_id": "custom-yes",
            "kind": "exact_text",
            "expected": "yes"
          }
        ]
      }
    ]
  }
]

Supported checker kinds today:

exact_text
normalized_text
regex_match
contains_all_text
json_field_equals
runner_fact_equals
runner_log_contains_text

Each submission now also records runner execution metadata:

workspace directory
materialized asset and submission artifacts
runner facts
runner logs

Example external catalog

The repo includes the bundled fallback catalog source at:

benchmark/catalogs/expanded-catalog.json

Use it like this:

cd /home/fritzprix/my_works/b1te
BENCHMARK_CATALOG_FILE=/home/fritzprix/my_works/b1te/benchmark/catalogs/expanded-catalog.json npm run inspector -w @fre4x/benchmark

If no catalog env is provided at runtime, the published package falls back to the bundled copy of this catalog automatically.

Claude Desktop

{
  "mcpServers": {
    "benchmark": {
        "command": "npx",
        "args": ["-y", "@fre4x/benchmark"],
        "env": {
          "BENCHMARK_CATALOG_URL": "https://example.com/benchmark-catalog.json",
          "BENCHMARK_CACHE_DIR": "/absolute/path/to/benchmark-cache"
        }
      }
    }
}

Development

npm install
npm run build -w @fre4x/benchmark
npm run typecheck -w @fre4x/benchmark
npm test -w @fre4x/benchmark
MOCK=true npm run inspector -w @fre4x/benchmark

Keywords

FAQs

What is @fre4x/benchmark?

Is @fre4x/benchmark popular?

Is @fre4x/benchmark well maintained?

Package last updated on 08 Jun 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@fre4x/benchmark

benchmark — Deterministic Agent Evaluation

Tools

Workflow

Fallback benchmark families

Zero-config run

Mock Mode

Optional environment

Catalog shape

Example external catalog

Claude Desktop

Development

Keywords

Related posts

npm Package Uses Prompt Injection and Token Flooding to Disrupt AI Malware Scanners

Introducing Manifest Alerts