🚀 Socket Launch Week Day 5:Introducing Repository Access Permissions and Custom Roles.Learn more →

@agent-pattern-labs/iso-eval

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

@agent-pattern-labs/iso-eval

Behavioral eval runner for AI coding agents — snapshot a workspace, hand it to a runner with a task prompt, score the resulting filesystem/git state.

latest

Source

npm

Version: 0.4.1

Version published: last month

Weekly downloads: 29

Maintainers: 1

Weekly downloads

Created: last month

Source

@agent-pattern-labs/iso-eval

Behavioral eval runner for AI coding agents.

agentmd lints prompt structure, isolint lints prompt prose, iso-harness fans out the compiled source into every harness file layout. None of them answer the next question: did the agent actually do the task? That's what @agent-pattern-labs/iso-eval scores.

You give it a suite of tasks — each with a baseline workspace, a prompt, and a set of checks — and it snapshots the workspace per trial, hands it to a runner, then verifies the resulting filesystem / command state against your checks.

Built-in runners today:

fake — deterministic CI/offline runner that executes $ ... lines from the prompt as shell in the snapshotted workspace.
codex — real-agent runner that shells out to codex exec in the per-trial workspace and captures the final assistant message.
claude-code — real-agent runner that shells out to claude -p in the per-trial workspace.
cursor — real-agent runner that shells out to cursor-agent --print in the per-trial workspace.
opencode — real-agent runner that shells out to opencode run in the per-trial workspace.

The library API still accepts any RunnerFn, so you can plug in other harnesses without waiting on a packaged runner.

Install

npm install -D @agent-pattern-labs/iso-eval

Suite shape

# eval.yml
suite: refactor-basic
runner: fake              # fake | codex | claude-code | cursor | opencode
timeoutMs: 120000
harness:
  source: ../dist         # optional: stage generated harness files into each trial

tasks:
  - id: write-greeting
    prompt: tasks/write-greeting.md    # path (relative to eval.yml) or inline
    workspace: workspace/              # baseline dir, copied per-trial into tmpdir
    trials: 1
    checks:
      - { type: file_exists,       path: greeting.txt }
      - { type: file_contains,     path: greeting.txt, value: "hello" }
      - { type: file_not_contains, path: greeting.txt, value: "TODO"  }
      - { type: command, run: "test -f greeting.txt", expectExit: 0 }

Supported checks

type	asserts
`command`	shell command exits with `expectExit` (default 0); optional stdout contains/matches
`file_exists`	file at `path` exists in the workspace
`file_contains`	file at `path` contains the literal substring `value`
`file_not_contains`	file at `path` does NOT contain `value`
`file_matches`	file at `path` matches the regex `matches`
`llm_judge`	a user-supplied `JudgeFn` answers yes to `prompt` against runner stdout/stderr
`agentmd_adherence`	per-rule pass rate from `agentmd test` meets `minPassRate`; optional `ruleId` filter

`agentmd_adherence`

- type: agentmd_adherence
  promptFile: ../agent.md         # path to agentmd source (relative to eval.yml)
  fixtures: ../fixtures.yml       # path to agentmd fixture file
  ruleId: H3                      # optional — score only this rule
  minPassRate: 0.9                # required — pass rate floor in [0, 1]
  via: claude-code                # optional — default claude-code (api | claude-code | fake)
  model: claude-haiku-4-5         # optional — forwarded as --model
  timeoutMs: 180000               # optional — subprocess timeout

Shells out to the agentmd CLI (bundled as a runtime dependency) via agentmd test <promptFile> --fixtures <fixtures> --format json, parses the per-rule check outcomes, computes the pass rate for ruleId (or overall if omitted), and fails the check when the rate is below minPassRate. Tests can inject a fake subprocess runner via the library API (AgentmdSpawnFn) so CI doesn't need an API key.

CLI

iso-eval run  examples/suites/echo-basic/eval.yml
iso-eval plan examples/suites/echo-basic/eval.yml

iso-eval run eval.yml --filter write-greeting --concurrency 2 --json
iso-eval run eval.yml --runner claude-code --harness-source ../dist
iso-eval run eval.yml --runner cursor --harness-source ../dist
iso-eval run eval.yml --runner opencode --harness-source ../dist
iso-eval run eval.yml --keep-workspaces           # skip tmpdir cleanup for debugging

run exits 0 on all-pass, 1 on any failure, 2 on invalid invocation.

--runner and --harness-source let you replay the same suite through a different packaged harness without rewriting checks.yml.

Real runners and harness staging

Set runner: in YAML, or override it at the CLI with --runner. harness.source is optional; when present, iso-eval stages the generated harness files you want the runner to see into each snapshotted workspace.

`codex`

suite: refactor-basic
runner: codex
timeoutMs: 180000
harness:
  source: ../dist

Accepted harness.source shapes:

a project directory containing AGENTS.md and/or .codex/
a direct AGENTS.md path
a direct .codex/config.toml path

`claude-code`

Accepted harness.source shapes:

a project directory containing CLAUDE.md, .claude/, and/or .mcp.json
a direct CLAUDE.md path
a direct .claude/ path
a direct .claude/settings.json path
a direct .mcp.json path

The runner shells out to claude -p --no-session-persistence and passes .mcp.json through --mcp-config when present.

`opencode`

Accepted harness.source shapes:

a project directory containing AGENTS.md, opencode.json, and/or .opencode/
a direct AGENTS.md path
a direct opencode.json path
a direct .opencode/ path

The runner shells out to opencode run --dir <workspace> and defaults to --pure so each trial stays self-contained.

`cursor`

Accepted harness.source shapes:

a project directory containing .cursor/, AGENTS.md, and/or CLAUDE.md
a direct .cursor/ path
a direct .cursor/rules/ path
a direct .cursor/rules/*.mdc path
a direct .cursor/mcp.json path
a direct AGENTS.md path
a direct CLAUDE.md path

The runner shells out to cursor-agent --print --output-format text --workspace <workspace> and stages any Cursor harness files you exported with iso-harness into the per-trial workspace first.

This lets one suite exported from iso-trace be replayed across the packaged runners with the same task prompt and checks.

Library API

import { loadSuite, run, formatReport, fakeRunner } from "@agent-pattern-labs/iso-eval";

const suite = loadSuite("./eval.yml");
const report = await run(suite, {
  runner: fakeRunner,
  concurrency: 2,
  onTaskComplete: (t) => console.log(t.id, t.passed ? "✓" : "✗"),
});
console.log(formatReport(report));
process.exit(report.passed ? 0 : 1);

Bring your own runner

The YAML runner: field selects from shipped runners; the library accepts any RunnerFn:

import type { RunnerFn } from "@agent-pattern-labs/iso-eval";

const myRunner: RunnerFn = async ({ workspaceDir, taskPrompt, timeoutMs, harnessSource }) => {
  // spawn your agent (claude -p / codex exec / …) with cwd = workspaceDir
  // optionally stage files from harnessSource before invoking it
  // return { exitCode, stdout, stderr, durationMs }
};

Bring your own judge (for `llm_judge` checks)

import type { JudgeFn } from "@agent-pattern-labs/iso-eval";

const judge: JudgeFn = async (prompt, output) => {
  // call your model; return true if the rule was followed
};

await run(suite, { runner: fakeRunner, judge });

How this fits the rest of the pipeline

agent.md  →  agentmd lint  →  agentmd render  →  isolint lint  →  iso-harness build
                                                                         │
                                                                         ▼
                                                          project w/ CLAUDE.md etc.
                                                                         │  iso-eval run
                                                                         ▼
                                                                per-task pass / fail

@agent-pattern-labs/agentmd measures per-rule adherence on text output (input string → output string → check).
@agent-pattern-labs/iso-eval measures task success on a real workspace (snapshot dir → agent acts → filesystem state → check).

The two compose: an iso-eval suite can include llm_judge checks that reuse the same judge convention (yes = rule followed), plus agentmd_adherence checks that fold a fixture-level adherence score into the task report.

License

MIT — see LICENSE.

Keywords

FAQs

What is @agent-pattern-labs/iso-eval?

Is @agent-pattern-labs/iso-eval popular?

Is @agent-pattern-labs/iso-eval well maintained?

Package last updated on 21 May 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@agent-pattern-labs/iso-eval

@agent-pattern-labs/iso-eval

Install

Suite shape

Supported checks

agentmd_adherence

CLI

Real runners and harness staging

codex

claude-code

opencode

cursor

Library API

Bring your own runner

Bring your own judge (for llm_judge checks)

How this fits the rest of the pipeline

License

Keywords

Related posts

GitHub Actions Checkout Now Blocks Risky pull_request_target Checkouts

Introducing Repository Access Permissions and Custom Roles

`agentmd_adherence`

`codex`

`claude-code`

`opencode`

`cursor`

Bring your own judge (for `llm_judge` checks)