🚀. Socket Launch Week Day 2:Introducing Manifest Alerts.Learn more
Sign In

agentbench

Package Overview
Dependencies
Maintainers
1
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

agentbench

Lighthouse for AI coding harnesses. Benchmark your Claude Code setup and get a score out of 100.

latest
npmnpm
Version
1.0.0
Version published
Maintainers
1
Created
Source

agentbench

Lighthouse for AI coding harnesses. Benchmark your Claude Code setup and get a score out of 100.

Why

You spend hours tweaking CLAUDE.md, skills, hooks, and rules — but have no way to measure if your changes actually help. People share configs on X and GitHub with no evidence they work better than the defaults.

Superpowers has 89K stars. Everyone has a different CLAUDE.md. Nobody can prove theirs is better.

agentbench gives you a number. Run it before and after every harness change. Share your score. Compete.

Install

npx agentbench

Or install globally:

npm install -g agentbench

Or clone and run locally:

git clone https://github.com/naman10parikh/agentbench.git
cd agentbench
pnpm install
pnpm dev

Quick Start

# Run the full benchmark (10 tasks, ~2 min, ~$0.50 API cost)
npx agentbench

# Run a single task to test quickly
npx agentbench --task 1

# Get JSON output for CI/scripts
npx agentbench --json

# Compare your harness against another CLAUDE.md
npx agentbench --compare ~/other-project/CLAUDE.md

# Re-run baseline from scratch (ignores cache)
npx agentbench --no-cache

# Use a specific model
npx agentbench --model claude-sonnet-4-6

# See detailed output for each task
npx agentbench --verbose

Example Output

  ┌─────────────────────────────────────────────┐
  │                 agentbench                   │
  │                                              │
  │  HARNESS SCORE: 73 / 100                    │
  │                                              │
  │  Completion    ████████░░  82%              │
  │  Efficiency    ██████░░░░  61%              │
  │  Tool Use      ████████░░  78%              │
  │  Recovery      █████░░░░░  54%              │
  │  Quality       █████████░  91%              │
  │                                              │
  │  vs. baseline: +31 points                   │
  │  Top tip: Add error recovery skills          │
  └─────────────────────────────────────────────┘

  Full report: .agentbench/report-2026-03-17.json

What It Measures

Your harness is scored across 5 dimensions (0-100 each):

DimensionWhat It MeasuresHow It's Scored
CompletionDid the agent finish each task correctly?Pass/fail per task, averaged
EfficiencyHow many tokens did it use vs. baseline?Ratio against bare defaults
Tool UseDid it pick the right tools for the job?Expected vs. actual tool selection
RecoveryDid it recover from injected errors?Error recovery rate on seeded failures
QualityHow good is the output code?LLM-as-judge evaluation (via Haiku)

The overall score is the average across all 5 dimensions.

The 10 Benchmark Tasks

Each task runs in an isolated temporary workspace — your actual code is never touched.

#TaskDifficultyWhat It Tests
1Fix a typo in a TypeScript fileEasyBasic navigation + edit
2Add a function with a specific signatureEasyCode generation accuracy
3Refactor a function to reduce complexityMediumCode understanding + simplification
4Write unit tests for an existing moduleMediumTest generation quality
5Fix a failing test by reading error outputMediumError comprehension + debugging
6Add error handling to a throwing functionMediumRecovery pattern knowledge
7Resolve a merge conflict in gitHardMulti-file coordination
8Find and fix a SQL injection vulnerabilityHardSecurity awareness
9Refactor 3 files to extract a shared utilityHardCross-file reasoning
10Add a REST endpoint with validation + testsHardFull-stack completion

Tasks 1-6 use automated checks (compilation, test pass, diff comparison). Tasks 7-10 use LLM-as-judge (Claude Haiku evaluates against a rubric).

How It Works

1. Detect harness  →  Reads CLAUDE.md, settings.json, .claude/skills/, .claude/rules/
2. Run tasks       →  10 coding tasks in isolated temp workspaces via Claude API
3. Evaluate        →  Automated checks + LLM-as-judge scoring
4. Compare         →  Score vs. cached baseline (bare Claude Code defaults)
5. Report          →  Terminal scorecard + JSON report + recommendations

Harness Detection

agentbench auto-detects your harness by reading:

File/DirWhat It Extracts
CLAUDE.mdSystem prompt, rules, operating model
.claude/settings.jsonHook definitions, permissions
.claude/skills/*.mdSkill count and content
.claude/rules/*.mdRule count and content

This config is injected into the Claude API system prompt when running each task, exactly as Claude Code would use it.

Baseline Caching

The first time you run agentbench, it benchmarks bare Claude Code defaults and caches the result. Subsequent runs only benchmark your harness and compare against the cache. The cache key is sha256(task_suite_version + model_version).

Clear cache: npx agentbench --no-cache

Cost Estimate

ComponentCost
10 tasks via Sonnet~$0.30-0.60
LLM-as-judge (4 tasks via Haiku)~$0.05-0.10
Baseline run (first time only)~$0.30-0.60
Typical run~$0.50

Configuration

FlagDefaultDescription
--task <n>allRun a single task by number (1-10)
--jsonfalseOutput results as JSON
--compare <path>noneCompare against another CLAUDE.md
--no-cachefalseIgnore cached baseline, re-run from scratch
--model <id>claude-sonnet-4-6Model to benchmark with
--verbosefalseShow detailed per-task output

JSON Output

With --json, agentbench outputs a structured report:

{
  "version": "0.1.0",
  "timestamp": "2026-03-17T10:30:00Z",
  "model": "claude-sonnet-4-6",
  "overallScore": 73,
  "baselineScore": 50,
  "dimensions": [
    { "name": "completion", "score": 82, "details": "8/10 tasks completed" },
    {
      "name": "efficiency",
      "score": 61,
      "details": "12,400 tokens (baseline: 18,200)"
    },
    { "name": "toolUse", "score": 78, "details": "7 unique tools used" },
    { "name": "recovery", "score": 54, "details": "2/4 errors recovered" },
    { "name": "quality", "score": 91, "details": "LLM judge average" }
  ],
  "recommendations": [
    "Add error recovery skills or an error escalation protocol to your harness"
  ],
  "harnessConfig": {
    "hasClaudeMd": true,
    "claudeMdLines": 142,
    "skillCount": 8,
    "ruleCount": 3,
    "hookCount": 5
  }
}

Adding Custom Tasks

Each task is a directory under tasks/ with:

tasks/NN-task-name/
├── task.json       # Task definition, prompt, scoring rubric
├── workspace/      # Initial repo state (the broken/incomplete code)
└── expected/       # Reference solution (for automated comparison)

task.json Format

{
  "id": 11,
  "name": "Your task name",
  "category": "bug-fix",
  "difficulty": "medium",
  "description": "What the task tests",
  "prompt": "The exact prompt sent to the agent",
  "expectedTools": ["Read", "Edit", "Bash"],
  "scoring": {
    "automated": true,
    "checks": ["tsc --noEmit exits 0", "specific check here"]
  }
}

Improving Your Score

Common harness improvements by dimension:

Low Score InWhat to Add
CompletionClearer task completion criteria in CLAUDE.md
EfficiencyModel routing rules (Haiku for simple subtasks)
Tool UseTool preference rules ("Use Read instead of cat")
RecoveryError escalation protocol, /troubleshoot skill
QualityCode style rules, pre-commit quality gate hook

Comparison to Existing Benchmarks

BenchmarkWhat It TestsOur Difference
SWE-benchModel ability on GitHub issuesWe test the harness, not the model
HumanEvalCode generationSingle-function, no tool use
MBPPPython programmingNo harness awareness
Aider benchmarkEdit accuracyDoesn't score efficiency or recovery

agentbench is the only benchmark that holds the model constant and measures the harness. Same model, different scaffold: 42% vs 78%. The harness IS the product.

Requirements

  • Node.js 18+
  • ANTHROPIC_API_KEY environment variable
  • ~$0.50 per benchmark run

Contributing

See CONTRIBUTING.md. We especially welcome:

  • New benchmark tasks — the more diverse, the better the signal
  • Scoring improvements — better rubrics, tighter automated checks
  • Bug reports — if a score feels wrong, tell us

License

MIT

Keywords

claude-code

FAQs

Package last updated on 20 Mar 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts