🚀 Socket Launch Week Day 5:Introducing Repository Access Permissions and Custom Roles.Learn more
Sign In

@agent-pattern-labs/iso-eval

Package Overview
Dependencies
Maintainers
1
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@agent-pattern-labs/iso-eval

Behavioral eval runner for AI coding agents — snapshot a workspace, hand it to a runner with a task prompt, score the resulting filesystem/git state.

latest
Source
npmnpm
Version
0.4.1
Version published
Maintainers
1
Created
Source

@agent-pattern-labs/iso-eval

Behavioral eval runner for AI coding agents.

agentmd lints prompt structure, isolint lints prompt prose, iso-harness fans out the compiled source into every harness file layout. None of them answer the next question: did the agent actually do the task? That's what @agent-pattern-labs/iso-eval scores.

You give it a suite of tasks — each with a baseline workspace, a prompt, and a set of checks — and it snapshots the workspace per trial, hands it to a runner, then verifies the resulting filesystem / command state against your checks.

Built-in runners today:

  • fake — deterministic CI/offline runner that executes $ ... lines from the prompt as shell in the snapshotted workspace.
  • codex — real-agent runner that shells out to codex exec in the per-trial workspace and captures the final assistant message.
  • claude-code — real-agent runner that shells out to claude -p in the per-trial workspace.
  • cursor — real-agent runner that shells out to cursor-agent --print in the per-trial workspace.
  • opencode — real-agent runner that shells out to opencode run in the per-trial workspace.

The library API still accepts any RunnerFn, so you can plug in other harnesses without waiting on a packaged runner.

Install

npm install -D @agent-pattern-labs/iso-eval

Suite shape

# eval.yml
suite: refactor-basic
runner: fake              # fake | codex | claude-code | cursor | opencode
timeoutMs: 120000
harness:
  source: ../dist         # optional: stage generated harness files into each trial

tasks:
  - id: write-greeting
    prompt: tasks/write-greeting.md    # path (relative to eval.yml) or inline
    workspace: workspace/              # baseline dir, copied per-trial into tmpdir
    trials: 1
    checks:
      - { type: file_exists,       path: greeting.txt }
      - { type: file_contains,     path: greeting.txt, value: "hello" }
      - { type: file_not_contains, path: greeting.txt, value: "TODO"  }
      - { type: command, run: "test -f greeting.txt", expectExit: 0 }

Supported checks

typeasserts
commandshell command exits with expectExit (default 0); optional stdout contains/matches
file_existsfile at path exists in the workspace
file_containsfile at path contains the literal substring value
file_not_containsfile at path does NOT contain value
file_matchesfile at path matches the regex matches
llm_judgea user-supplied JudgeFn answers yes to prompt against runner stdout/stderr
agentmd_adherenceper-rule pass rate from agentmd test meets minPassRate; optional ruleId filter

agentmd_adherence

- type: agentmd_adherence
  promptFile: ../agent.md         # path to agentmd source (relative to eval.yml)
  fixtures: ../fixtures.yml       # path to agentmd fixture file
  ruleId: H3                      # optional — score only this rule
  minPassRate: 0.9                # required — pass rate floor in [0, 1]
  via: claude-code                # optional — default claude-code (api | claude-code | fake)
  model: claude-haiku-4-5         # optional — forwarded as --model
  timeoutMs: 180000               # optional — subprocess timeout

Shells out to the agentmd CLI (bundled as a runtime dependency) via agentmd test <promptFile> --fixtures <fixtures> --format json, parses the per-rule check outcomes, computes the pass rate for ruleId (or overall if omitted), and fails the check when the rate is below minPassRate. Tests can inject a fake subprocess runner via the library API (AgentmdSpawnFn) so CI doesn't need an API key.

CLI

iso-eval run  examples/suites/echo-basic/eval.yml
iso-eval plan examples/suites/echo-basic/eval.yml

iso-eval run eval.yml --filter write-greeting --concurrency 2 --json
iso-eval run eval.yml --runner claude-code --harness-source ../dist
iso-eval run eval.yml --runner cursor --harness-source ../dist
iso-eval run eval.yml --runner opencode --harness-source ../dist
iso-eval run eval.yml --keep-workspaces           # skip tmpdir cleanup for debugging

run exits 0 on all-pass, 1 on any failure, 2 on invalid invocation.

--runner and --harness-source let you replay the same suite through a different packaged harness without rewriting checks.yml.

Real runners and harness staging

Set runner: in YAML, or override it at the CLI with --runner. harness.source is optional; when present, iso-eval stages the generated harness files you want the runner to see into each snapshotted workspace.

codex

suite: refactor-basic
runner: codex
timeoutMs: 180000
harness:
  source: ../dist

Accepted harness.source shapes:

  • a project directory containing AGENTS.md and/or .codex/
  • a direct AGENTS.md path
  • a direct .codex/config.toml path

claude-code

Accepted harness.source shapes:

  • a project directory containing CLAUDE.md, .claude/, and/or .mcp.json
  • a direct CLAUDE.md path
  • a direct .claude/ path
  • a direct .claude/settings.json path
  • a direct .mcp.json path

The runner shells out to claude -p --no-session-persistence and passes .mcp.json through --mcp-config when present.

opencode

Accepted harness.source shapes:

  • a project directory containing AGENTS.md, opencode.json, and/or .opencode/
  • a direct AGENTS.md path
  • a direct opencode.json path
  • a direct .opencode/ path

The runner shells out to opencode run --dir <workspace> and defaults to --pure so each trial stays self-contained.

cursor

Accepted harness.source shapes:

  • a project directory containing .cursor/, AGENTS.md, and/or CLAUDE.md
  • a direct .cursor/ path
  • a direct .cursor/rules/ path
  • a direct .cursor/rules/*.mdc path
  • a direct .cursor/mcp.json path
  • a direct AGENTS.md path
  • a direct CLAUDE.md path

The runner shells out to cursor-agent --print --output-format text --workspace <workspace> and stages any Cursor harness files you exported with iso-harness into the per-trial workspace first.

This lets one suite exported from iso-trace be replayed across the packaged runners with the same task prompt and checks.

Library API

import { loadSuite, run, formatReport, fakeRunner } from "@agent-pattern-labs/iso-eval";

const suite = loadSuite("./eval.yml");
const report = await run(suite, {
  runner: fakeRunner,
  concurrency: 2,
  onTaskComplete: (t) => console.log(t.id, t.passed ? "✓" : "✗"),
});
console.log(formatReport(report));
process.exit(report.passed ? 0 : 1);

Bring your own runner

The YAML runner: field selects from shipped runners; the library accepts any RunnerFn:

import type { RunnerFn } from "@agent-pattern-labs/iso-eval";

const myRunner: RunnerFn = async ({ workspaceDir, taskPrompt, timeoutMs, harnessSource }) => {
  // spawn your agent (claude -p / codex exec / …) with cwd = workspaceDir
  // optionally stage files from harnessSource before invoking it
  // return { exitCode, stdout, stderr, durationMs }
};

Bring your own judge (for llm_judge checks)

import type { JudgeFn } from "@agent-pattern-labs/iso-eval";

const judge: JudgeFn = async (prompt, output) => {
  // call your model; return true if the rule was followed
};

await run(suite, { runner: fakeRunner, judge });

How this fits the rest of the pipeline

agent.md  →  agentmd lint  →  agentmd render  →  isolint lint  →  iso-harness build
                                                                         │
                                                                         ▼
                                                          project w/ CLAUDE.md etc.
                                                                         │  iso-eval run
                                                                         ▼
                                                                per-task pass / fail
  • @agent-pattern-labs/agentmd measures per-rule adherence on text output (input string → output string → check).
  • @agent-pattern-labs/iso-eval measures task success on a real workspace (snapshot dir → agent acts → filesystem state → check).

The two compose: an iso-eval suite can include llm_judge checks that reuse the same judge convention (yes = rule followed), plus agentmd_adherence checks that fold a fixture-level adherence score into the task report.

License

MIT — see LICENSE.

Keywords

agent

FAQs

Package last updated on 21 May 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts