@fre4x/benchmark
Advanced tools
+1
-1
| { | ||
| "name": "@fre4x/benchmark", | ||
| "version": "1.1.0-beta.4", | ||
| "version": "1.1.0-beta.6", | ||
| "description": "A deterministic benchmark MCP server for agent evaluation workflows.", | ||
@@ -5,0 +5,0 @@ "type": "module", |
+13
-13
@@ -11,18 +11,18 @@ # benchmark — Deterministic Agent Evaluation | ||
| |------|---------| | ||
| | `benchmark_list_challenges` | List deterministic benchmark suites with family, runner, and checker metadata | | ||
| | `benchmark_get_catalog_status` | Inspect catalog source configuration, cache state, and availability | | ||
| | `benchmark_sync_catalog` | Fetch and cache the remote benchmark catalog when a URL source is configured | | ||
| | `benchmark_start_challenge` | Start an attempt and return the first task | | ||
| | `benchmark_submit_solution` | Grade one task and return checker evidence plus the next task or final score | | ||
| | `benchmark_get_asset` | Read an attached benchmark asset by `asset_id` | | ||
| | `benchmark_get_attempt` | Inspect attempt status, current task, and paginated evaluation history | | ||
| | `benchmark_cancel_attempt` | Cancel an active attempt | | ||
| | `list_challenges` | List deterministic benchmark suites with family, runner, and checker metadata | | ||
| | `get_catalog_status` | Inspect catalog source configuration, cache state, and availability | | ||
| | `sync_catalog` | Fetch and cache the remote benchmark catalog when a URL source is configured | | ||
| | `start_challenge` | Start an attempt and return the first task | | ||
| | `submit_solution` | Grade one task and return checker evidence plus the next task or final score | | ||
| | `get_asset` | Read an attached benchmark asset by `asset_id` | | ||
| | `get_attempt` | Inspect attempt status, current task, and paginated evaluation history | | ||
| | `cancel_attempt` | Cancel an active attempt | | ||
| ## Workflow | ||
| 1. Call `benchmark_list_challenges` | ||
| 1. Call `list_challenges` | ||
| 2. Pick a `challenge_id` | ||
| 3. Call `benchmark_start_challenge` | ||
| 4. If the task has assets, call `benchmark_get_asset` | ||
| 5. Call `benchmark_submit_solution` | ||
| 3. Call `start_challenge` | ||
| 4. If the task has assets, call `get_asset` | ||
| 5. Call `submit_solution` | ||
| 6. Repeat until `done: true` | ||
@@ -81,3 +81,3 @@ | ||
| When `BENCHMARK_CATALOG_URL` is set, the package will reuse a fresh cached copy when available and can be explicitly refreshed with `benchmark_sync_catalog`. | ||
| When `BENCHMARK_CATALOG_URL` is set, the package will reuse a fresh cached copy when available and can be explicitly refreshed with `sync_catalog`. | ||
@@ -84,0 +84,0 @@ ## Catalog shape |
Sorry, the diff of this file is too big to display
1194049
-0.02%