🚀. Socket Launch Week Day 2:Introducing Manifest Alerts.Learn more
Sign In

@fre4x/benchmark

Package Overview
Dependencies
Maintainers
1
Versions
7
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@fre4x/benchmark

A deterministic benchmark MCP server for agent evaluation workflows.

latest
npmnpm
Version
1.1.1
Version published
Weekly downloads
15
-92.42%
Maintainers
1
Weekly downloads
 
Created
Source

benchmark — Deterministic Agent Evaluation

This package exposes a consistent MCP workflow for deterministic benchmark-driven agent evaluation.

The rebuilt core is organized around challenge catalogs, typed task assets, and explicit checker kinds so coding, web, and OS-style tasks can share one MCP surface without relying on LLM judges.

Tools

ToolPurpose
list_challengesList deterministic benchmark suites with family, runner, and checker metadata
get_catalog_statusInspect catalog source configuration, cache state, and availability
sync_catalogFetch and cache the remote benchmark catalog when a URL source is configured
start_challengeStart an attempt and return the first task
submit_solutionGrade one task and return checker evidence plus the next task or final score
get_assetRead an attached benchmark asset by asset_id
get_attemptInspect attempt status, current task, and paginated evaluation history
cancel_attemptCancel an active attempt

Workflow

  • Call list_challenges
  • Pick a challenge_id
  • Call start_challenge
  • If the task has assets, call get_asset
  • Call submit_solution
  • Repeat until done: true

Each response includes machine-readable guidance for the most likely next tool call.

Fallback benchmark families

  • Code — deterministic JSON/text answers backed by explicit checkers
  • Web — DOM snapshot extraction tasks with JSON field assertions
  • OS — filesystem/log review tasks with deterministic text grading

Zero-config run

Run with the bundled fallback catalog and no extra configuration:

npx @fre4x/benchmark

Or from this repo:

cd /home/fritzprix/my_works/b1te
npm run inspector -w @fre4x/benchmark

Mock Mode

Run with the same bundled fallback catalog in mock mode:

MOCK=true npx @fre4x/benchmark

Optional environment

BENCHMARK_CATALOG_FILE=/absolute/path/to/benchmark-catalog.json
BENCHMARK_CATALOG_URL=https://example.com/benchmark-catalog.json
BENCHMARK_CACHE_DIR=/absolute/path/to/catalog-cache
BENCHMARK_CACHE_TTL_SECONDS=3600
BENCHMARK_STATE_DIR=/absolute/path/to/store-attempt-json
BENCHMARK_MOCK=true
  • BENCHMARK_CATALOG_FILE: Optional JSON file with challenge definitions in the rebuilt deterministic catalog format
  • BENCHMARK_CATALOG_URL: Optional remote JSON catalog URL for fetch/cache based ingestion
  • BENCHMARK_CACHE_DIR: Optional cache directory for remote catalog snapshots
  • BENCHMARK_CACHE_TTL_SECONDS: Freshness window for remote catalog cache reuse
  • BENCHMARK_STATE_DIR: Where attempt files and lock directories are persisted
  • BENCHMARK_MOCK: Alternate mock-mode flag

BENCHMARK_GAIA_DATA_FILE is still accepted as a backward-compatible alias, but the rebuilt package is no longer GAIA-first.

When BENCHMARK_CATALOG_URL is set, the package will reuse a fresh cached copy when available and can be explicitly refreshed with sync_catalog.

Catalog shape

External catalogs must be a JSON array of challenge definitions shaped like:

[
  {
    "challenge_id": "custom_suite",
    "benchmark_id": "custom",
    "family": "code",
    "runner_kind": "code_runner",
    "title": "Custom Challenge",
    "description": "Deterministic single-task suite",
    "version": "v1",
    "source": "external",
    "tasks": [
      {
        "task_id": "custom-1",
        "title": "Return yes",
        "prompt": "Return only yes.",
        "response_format": "text",
        "difficulty": 1,
        "assets": [],
        "checkers": [
          {
            "checker_id": "custom-yes",
            "kind": "exact_text",
            "expected": "yes"
          }
        ]
      }
    ]
  }
]

Supported checker kinds today:

  • exact_text
  • normalized_text
  • regex_match
  • contains_all_text
  • json_field_equals
  • runner_fact_equals
  • runner_log_contains_text

Each submission now also records runner execution metadata:

  • workspace directory
  • materialized asset and submission artifacts
  • runner facts
  • runner logs

Example external catalog

The repo includes the bundled fallback catalog source at:

benchmark/catalogs/expanded-catalog.json

Use it like this:

cd /home/fritzprix/my_works/b1te
BENCHMARK_CATALOG_FILE=/home/fritzprix/my_works/b1te/benchmark/catalogs/expanded-catalog.json npm run inspector -w @fre4x/benchmark

If no catalog env is provided at runtime, the published package falls back to the bundled copy of this catalog automatically.

Claude Desktop

{
  "mcpServers": {
    "benchmark": {
        "command": "npx",
        "args": ["-y", "@fre4x/benchmark"],
        "env": {
          "BENCHMARK_CATALOG_URL": "https://example.com/benchmark-catalog.json",
          "BENCHMARK_CACHE_DIR": "/absolute/path/to/benchmark-cache"
        }
      }
    }
}

Development

npm install
npm run build -w @fre4x/benchmark
npm run typecheck -w @fre4x/benchmark
npm test -w @fre4x/benchmark
MOCK=true npm run inspector -w @fre4x/benchmark

Keywords

mcp

FAQs

Package last updated on 08 Jun 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts