Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement
Sign In

claude-test-bench

Package Overview
Dependencies
Maintainers
1
Versions
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

claude-test-bench

Benchmark AI agent behavior — A/B test CLAUDE.md instructions, compare models, and grade runs with LLM-based evaluation

latest
Source
npmnpm
Version
1.0.1
Version published
Maintainers
1
Created
Source

Claude Test Bench

Do your CLAUDE.md instructions actually work? Now you can measure it.

npm version npm downloads GitHub stars GitHub issues License

Node.js TypeScript React 19 Express 5 357 Tests Visitors

What It Does  ·  Use Cases  ·  Quick Start  ·  Test Suites  ·  Evaluation  ·  Concepts  ·  Architecture  ·  Dev

What It Does

Claude Test Bench is a local benchmarking tool for testing AI agent configurations using the Claude Agent SDK. You define a Provider (connection credentials + model), pair it with a Scenario (task, agent config, workspace files, and grading rubric), run it, and evaluate the results through a web UI — with LLM-based grading that tracks which specific instructions were followed, violated, or deemed not applicable.

The core insight: writing CLAUDE.md instructions is easy, but knowing whether they actually change agent behavior is hard. Claude Test Bench solves this by running controlled experiments and grading the results with a multi-stage evaluation pipeline.

Provider (credentials + model)  ──────────────────────┐
                                                       ├──▶  Run (SDK query)  ──▶  Evaluation
Scenario (prompt + CLAUDE.md + agent config + rubric)  ┘

Use Cases

Use CaseWhat You TestWhat You Learn
CLAUDE.md A/B TestingSame task with and without CLAUDE.mdWhether your instructions actually change agent behavior
Model ComparisonSame scenario across Sonnet, Opus, etc.Which model reasons best for your specific task
Skill & Subagent TestingSkill and subagent definitions in a scenarioWhether your custom tools work as intended
Instruction EffectivenessIndividual CLAUDE.md rulesWhich specific instructions were followed vs. ignored
Regression TestingSame scenario across code changesWhether prompt/config changes improve or degrade quality
Evaluation CalibrationSame run with different evaluator configsWhether your grading rubric produces consistent scores

Quick Start

Install from npm

npm install -g claude-test-bench
claude-test-bench
# Opens browser at http://localhost:3847

Or run from source

git clone https://github.com/Z-M-Huang/claude-test-bench.git
cd claude-test-bench
npm install
npm run build && node dist/bin/ctb.js

Once the UI opens:

  • Create a Provider — Add your API key, base URL, and model on the Providers page
  • Pick a Scenario — 8 built-in scenarios are seeded on first launch, or create your own
  • Run — Pair a provider with a scenario and execute
  • Evaluate — Grade the completed run with an LLM evaluator

CLI flags:

FlagDefaultDescription
--port N3847HTTP port
--log-levelinfodebug, info, warn, error
--open / --no-open--openAuto-open browser on launch

Built-in Test Suites

Claude Test Bench ships with 8 ready-to-run scenarios in 4 paired suites. Each pair has a baseline (no CLAUDE.md) and an instruction-guided version (with CLAUDE.md) so you can directly measure instruction effectiveness.

SuiteBaseline CategoryWith CLAUDE.md CategoryWhat It TestsKey Trap
Migration PlanningreasoningreasoningCritical path reasoning under time constraintsNaive sequential scheduling misses the window
Car Wash Testreasoninginstruction-followingGoal-oriented physical reasoningRecommending walking 50m instead of driving the car to the car wash
Negative Analysisreasoninginstruction-followingFailure-first evaluation of a startup pitchSycophantic "looks great!" response vs. structured risk analysis
Golden Rulesinstruction-followinginstruction-followingAuth refactor with 7 deliberate trapsSycophancy bait, fake dead code, timing attack, push-to-main

Each suite is in docs/schemas/ and auto-seeds on first launch. The instruction-guided versions include CLAUDE.md content designed to improve performance on that specific task — the evaluation then measures whether the instructions were actually effective.

How Evaluation Works

Every completed run is graded by an LLM evaluator using a split-query strategy — multiple focused queries instead of one monolithic prompt:

Run Transcript
     │
     ▼
┌─────────────────┐     ┌──────────────────┐
│  1. Score Query  │     │ 2. Compliance    │
│                  │     │    Query         │
│ • Dimension      │     │                  │
│   scores (0-10)  │     │ • Instruction    │
│ • Answer         │     │   compliance     │
│   comparison     │     │   (followed /    │
│ • Critical       │     │    violated /    │
│   requirements   │     │    not_applicable│
│                  │     │    per rule)     │
└────────┬────────┘     └────────┬─────────┘
         │                       │
         ▼                       ▼
┌─────────────────────────────────────────┐
│          3. Debate Rounds               │
│  (optional, multi-evaluator consensus)  │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│          4. Synthesis Query             │
│  Final weighted scores + confidence     │
│  + dissenting opinions                  │
└─────────────────────────────────────────┘

The evaluation produces:

  • Per-dimension scores (0-10) for each scoring dimension
  • Answer comparison — match status, similarity (0-1), explanation
  • Critical requirement results — binary pass/fail with evidence
  • Instruction compliance report — each CLAUDE.md instruction classified as followed, violated, or not_applicable
  • Strengths and weaknesses — summary assessment from the evaluator
  • Weighted total score with confidence level and dissent tracking
  • Cost tracking — aggregated API cost per evaluator

Core Concepts

Provider

A Provider defines how to connect to an LLM. It contains only connection credentials and model configuration:

{
  "name": "Sonnet via API",
  "description": "Anthropic API with claude-sonnet-4-6",
  "provider": {
    "kind": "api",
    "baseUrl": "https://api.anthropic.com",
    "apiKey": "<ANTHROPIC_API_KEY>",
    "model": "claude-sonnet-4-6"
  },
  "timeoutSeconds": 300
}

Providers support two auth modes: api (API key + base URL) and oauth (OAuth token). See docs/schemas/provider.example.json, docs/schemas/provider-oauth.example.json, and docs/schemas/provider-api.example.json.

Security note: Provider JSON files are stored in .claude-test-bench/providers/ which is gitignored. Credentials are stored locally in plaintext and transmitted to the configured provider endpoint at run time.

Scenario

A Scenario defines the complete test: the task, agent configuration, workspace, and grading rubric. Agent behavior settings (CLAUDE.md, rules, skills, subagents, MCP servers, permissions) live here — not on the provider — so you can test the same model with different configurations.

{
  "name": "Migration Planning Scenario",
  "category": "reasoning",
  "claudeMdFiles": [
    { "role": "project", "content": "# Rules\n- Always calculate the critical path first" }
  ],
  "rules": [],
  "skills": [],
  "subagents": [],
  "mcpServers": [],
  "permissionMode": "acceptEdits",
  "allowedTools": ["Read", "Write", "Edit"],
  "prompt": "Read migration_plan.md and create an optimal schedule...",
  "workspaceFiles": [{ "path": "migration_plan.md", "content": "..." }],
  "expectedAnswer": "The migration CANNOT fit in the 4-hour window...",
  "criticalRequirements": ["Must identify the window is exceeded"],
  "gradingGuidelines": "Grade on correctness, reasoning quality...",
  "scoringDimensions": [
    { "name": "Correctness", "weight": 0.4, "description": "..." }
  ]
}

Key fields:

FieldPurpose
claudeMdFilesCLAUDE.md entries (project and/or user role) written to the workspace
rules, skills, subagentsNamed agent configuration entries
mcpServersMCP server configs (stdio, http, or sse transport)
permissionModeSDK permission mode (default, acceptEdits, bypassPermissions, plan, dontAsk)
allowedTools / disallowedToolsTool allow/deny lists
maxTurnsMaximum conversation turns
promptThe task given to the agent
workspaceFilesFiles written to a temp directory before execution
expectedAnswerGround truth for comparison
criticalRequirementsMust-pass checks (binary pass/fail)
gradingGuidelinesThe LLM grading prompt — tells the evaluator what to look for
scoringDimensionsNamed dimensions with weights (must sum to 1.0), each scored 0-10

Categories: planning, instruction-following, reasoning, tool-strategy, error-handling, ambiguity-handling, scope-management, custom.

See docs/schemas/scenario.example.json and docs/schemas/scenario-baseline.example.json for complete examples.

Run

A Run pairs one Provider with one Scenario. The Claude Agent SDK's query() function executes the scenario prompt inside an isolated temp workspace with the scenario's agent configuration (CLAUDE.md, rules, skills, etc.). The full execution transcript — tool calls, thinking, output — is captured and stored.

Evaluation

An Evaluation grades a completed Run using the split-query strategy described above. One or more LLM evaluators read the full transcript and produce scores, compliance reports, and a synthesized verdict.

How to Test CLAUDE.md Effectiveness

Step 1: Create a Provider

Create a provider with your API credentials on the Providers page, or write JSON to .claude-test-bench/providers/{id}.json.

Step 2: Create two Scenarios — baseline and with instructions

Baseline (no CLAUDE.md):

{
  "name": "Migration Baseline",
  "claudeMdFiles": [],
  "prompt": "Read migration_plan.md and create an optimal schedule...",
  "...": "..."
}

With instructions (CLAUDE.md configured):

{
  "name": "Migration With Instructions",
  "claudeMdFiles": [{
    "role": "project",
    "content": "# Reasoning Guidelines\n\n1. Always calculate the critical path before proposing a schedule\n2. Verify all constraints are satisfied before declaring success\n3. If a constraint cannot be met, state this clearly rather than forcing a solution\n4. Show all arithmetic work"
  }],
  "prompt": "Read migration_plan.md and create an optimal schedule...",
  "...": "..."
}

Both scenarios should have identical prompts, workspace files, and grading criteria — only the claudeMdFiles differ.

Step 3: Write instruction-aware grading guidelines

Write gradingGuidelines that evaluate both correctness AND whether each CLAUDE.md instruction visibly influenced behavior. See docs/schemas/scenario-with-claude-md.example.json for a full example.

Step 4: Run both scenarios against the same provider

Use the Run page to execute each scenario with your provider.

Step 5: Evaluate and compare

Evaluate both runs. The compliance query will report which instructions were followed, violated, or not_applicable. Compare the two evaluation reports to see whether instructions changed behavior and improved scores.

Creating Providers & Scenarios

You can create providers and scenarios through the web UI or by writing JSON files directly:

  • Via UI — Navigate to the Providers or Scenarios page in the web interface
  • Via JSON — Write files to .claude-test-bench/providers/{id}.json or .claude-test-bench/scenarios/custom/{id}.json
  • Via AI — Ask an AI assistant to generate JSON matching the schemas in docs/schemas/

Reference examples:

FileDescription
docs/schemas/provider.example.jsonAPI provider with CLAUDE.md
docs/schemas/provider-oauth.example.jsonOAuth provider (baseline)
docs/schemas/provider-api.example.jsonAPI provider with reasoning instructions
docs/schemas/scenario.example.jsonMulti-file refactor scenario
docs/schemas/scenario-baseline.example.jsonMigration scenario (baseline grading)
docs/schemas/scenario-with-claude-md.example.jsonMigration scenario (instruction-aware grading)
docs/schemas/scenario-carwash-baseline.example.jsonCar Wash Test (baseline)
docs/schemas/scenario-carwash-with-claude-md.example.jsonCar Wash Test (with goal-first reasoning)
docs/schemas/scenario-negative-analysis-baseline.example.jsonStartup evaluation (baseline)
docs/schemas/scenario-negative-analysis-with-claude-md.example.jsonStartup evaluation (with negative-first framework)
docs/schemas/scenario-golden-rules-baseline.example.jsonAuth refactor with 7 traps (baseline)
docs/schemas/scenario-golden-rules-with-claude-md.example.jsonAuth refactor with 7 traps (with Golden CLAUDE.md)

Data Storage & Security

All data lives in .claude-test-bench/ in the current working directory (typically the repo root):

.claude-test-bench/
  providers/{id}.json           # Provider records (contains API keys)
  scenarios/custom/{id}.json    # User-created scenarios
  runs/{id}.json                # Run records with full transcript
  evaluations/{id}.json         # Evaluation records with scores
  logs/ctb.log                  # JSON log file (rotated at 2MB, 25 files max)

This entire directory is gitignored. Credentials are stored locally in plaintext and transmitted to the configured provider endpoint at run time. Never commit .claude-test-bench/ or .env files.

Architecture

bin/ctb.ts  ──▶  Express app (src/server/index.ts)  ──▶  React SPA (src/web/)
                      │
         IStorage  IRunner  IEvaluator  ILogger  IWorkspaceBuilder
            │          │         │          │          │
     JsonFileStorage  ScenarioRunner  EvaluationOrchestrator  JsonLogger  WorkspaceBuilder
            │          │         │
     .claude-test-bench/   SDK query()   SDK query() (split-query eval)

Tech Stack

LayerTechnology
BackendExpress 5, Node.js (ESM)
FrontendReact 19, React Router 7, Tailwind CSS 4
BuildTypeScript 5, Vite 6
Agent SDK@anthropic-ai/claude-agent-sdk (query, streaming)
TestingVitest (unit), Playwright (E2E), Supertest (routes)
StorageFile-based JSON in .claude-test-bench/
StreamingServer-Sent Events (SSE) for run and eval progress

Design Patterns

  • Interface-first — Every service has an interface in src/server/interfaces/. Route factories accept interfaces, not implementations. This enables testing with mocks.
  • Route factories — All route files export createXxxRoutes(deps) returning an Express Router.
  • FsAdapter — File system operations go through an adapter for testability.
  • Workspace isolation — Each run gets its own temp directory with CLAUDE.md, rules, skills, and workspace files.

Development

# Run tests
npm test

# Run tests with coverage
npm run test:coverage

# Lint (source files)
npm run lint

# Format (source files)
npm run format

# Dev mode (watch server TypeScript compilation)
npm run dev

# Build everything (server + web)
npm run build

npm run dev watches only the server TypeScript. For frontend development, use npx vite dev in the src/web/ directory or run npm run build:web after changes.

Project Structure

claude-test-bench/
  bin/
    ctb.ts                          # CLI entry point (port, log-level, --open/--no-open)
  docs/
    schemas/                        # Example JSON for providers and scenarios
  e2e/                              # Playwright end-to-end tests
  src/
    server/
      index.ts                      # Express app factory (createApp)
      interfaces/
        evaluator.ts                # IEvaluator
        logger.ts                   # ILogger
        runner.ts                   # IRunner
        storage.ts                  # IStorage
        workspace.ts                # IWorkspaceBuilder
      routes/
        providers.ts                # /api/providers CRUD
        scenarios.ts                # /api/scenarios CRUD
        runs.ts                     # /api/runs + run execution
        evaluations.ts              # /api/evaluations + eval pipeline
        run-queue.ts                # Run queue management
        eval-queue.ts               # Eval queue management
        run-sse.ts                  # SSE streaming for run progress
      services/
        storage.ts                  # JsonFileStorage (file-based JSON)
        runner.ts                   # ScenarioRunner (SDK query())
        evaluator.ts                # EvaluationOrchestrator (split-query eval)
        workspace.ts                # WorkspaceBuilder (temp dir per run)
        agent-mapper.ts             # Scenario types -> SDK option types
        env-builder.ts              # Provider config -> env vars
        eval-prompts.ts             # Prompt builders for eval queries
        eval-parsers.ts             # Response parsers for eval results
        eval-parsers-debate-impl.ts # Debate round parsing
        eval-helpers.ts             # Consensus, answer comparison, compliance merge
        instruction-parser.ts       # Splits CLAUDE.md into testable blocks
        transcript-formatter.ts     # Formats run messages for eval context
        fs-adapter.ts               # File system abstraction for testing
        logger.ts                   # JsonLogger with file output
        log-rotator.ts              # Log rotation (2MB/file, 25 files max)
        seeder.ts                   # Seed storage on first launch
      types/
        provider.ts                 # Provider, ProviderConfig, ClaudeMdEntry, etc.
        scenario.ts                 # Scenario, WorkspaceFile, ScenarioCategory
        run.ts                      # Run, RunStatus, SDKMessageRecord
        evaluation.ts               # Evaluation, EvaluationRound, InstructionCompliance
        index.ts                    # Re-exports
    web/
      src/
        App.tsx                     # React router (all page routes)
        api.ts                      # API client
        main.tsx                    # Entry point
        index.css                   # Tailwind CSS entry
        components/                 # Shared UI components
        hooks/                      # Shared React hooks (useLiveProcess, etc.)
        pages/
          Dashboard.tsx             # /
          ProviderList.tsx          # /providers
          ProviderEditor.tsx        # /providers/new, /providers/:id/edit
          ScenarioList.tsx          # /scenarios
          ScenarioEditor.tsx        # /scenarios/new, /scenarios/:id
          RunPage.tsx               # /run
          RunHistory.tsx            # /history (split-panel with eval sidebar)
          RunDetail.tsx             # /runs/:id
          EvalConfig.tsx            # /runs/:id/evaluate
          ReportView.tsx            # /evaluations/:id
  .env.example                      # Environment template
  package.json                      # Scripts, deps, bin entries
  tsconfig.json                     # Base TypeScript config
  tsconfig.server.json              # Server build config
  tsconfig.bin.json                 # CLI build config
  vite.config.ts                    # Vite config (web build)
  vitest.config.ts                  # Test runner config
  playwright.config.ts              # E2E test config
  tailwind.config.ts                # Tailwind configuration

Contributing

Contributions welcome! Please open an issue first to discuss what you'd like to change.

git clone https://github.com/Z-M-Huang/claude-test-bench.git
cd claude-test-bench
npm install
npm test
npm run build
npm run lint

License

Apache-2.0

Built for testing Claude Code agent configurations. Not affiliated with or endorsed by Anthropic.

Keywords

claude

FAQs

Package last updated on 27 Mar 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts