
Security News
The Code You Didn't Write Is Still Yours to Defend
AI agents are pulling packages into environments no scanner is watching, creating exposure before security teams can see it.
@agent-pattern-labs/iso-eval
Advanced tools
Behavioral eval runner for AI coding agents — snapshot a workspace, hand it to a runner with a task prompt, score the resulting filesystem/git state.
Behavioral eval runner for AI coding agents.
agentmd lints prompt structure, isolint lints prompt prose,
iso-harness fans out the compiled source into every harness file layout.
None of them answer the next question: did the agent actually do the
task? That's what @agent-pattern-labs/iso-eval scores.
You give it a suite of tasks — each with a baseline workspace, a prompt, and a set of checks — and it snapshots the workspace per trial, hands it to a runner, then verifies the resulting filesystem / command state against your checks.
Built-in runners today:
fake — deterministic CI/offline runner that executes $ ... lines
from the prompt as shell in the snapshotted workspace.codex — real-agent runner that shells out to codex exec in the
per-trial workspace and captures the final assistant message.claude-code — real-agent runner that shells out to claude -p in
the per-trial workspace.cursor — real-agent runner that shells out to cursor-agent --print
in the per-trial workspace.opencode — real-agent runner that shells out to opencode run in
the per-trial workspace.The library API still accepts any RunnerFn, so you can plug in other
harnesses without waiting on a packaged runner.
npm install -D @agent-pattern-labs/iso-eval
# eval.yml
suite: refactor-basic
runner: fake # fake | codex | claude-code | cursor | opencode
timeoutMs: 120000
harness:
source: ../dist # optional: stage generated harness files into each trial
tasks:
- id: write-greeting
prompt: tasks/write-greeting.md # path (relative to eval.yml) or inline
workspace: workspace/ # baseline dir, copied per-trial into tmpdir
trials: 1
checks:
- { type: file_exists, path: greeting.txt }
- { type: file_contains, path: greeting.txt, value: "hello" }
- { type: file_not_contains, path: greeting.txt, value: "TODO" }
- { type: command, run: "test -f greeting.txt", expectExit: 0 }
| type | asserts |
|---|---|
command | shell command exits with expectExit (default 0); optional stdout contains/matches |
file_exists | file at path exists in the workspace |
file_contains | file at path contains the literal substring value |
file_not_contains | file at path does NOT contain value |
file_matches | file at path matches the regex matches |
llm_judge | a user-supplied JudgeFn answers yes to prompt against runner stdout/stderr |
agentmd_adherence | per-rule pass rate from agentmd test meets minPassRate; optional ruleId filter |
agentmd_adherence- type: agentmd_adherence
promptFile: ../agent.md # path to agentmd source (relative to eval.yml)
fixtures: ../fixtures.yml # path to agentmd fixture file
ruleId: H3 # optional — score only this rule
minPassRate: 0.9 # required — pass rate floor in [0, 1]
via: claude-code # optional — default claude-code (api | claude-code | fake)
model: claude-haiku-4-5 # optional — forwarded as --model
timeoutMs: 180000 # optional — subprocess timeout
Shells out to the agentmd CLI (bundled as a runtime dependency) via
agentmd test <promptFile> --fixtures <fixtures> --format json, parses
the per-rule check outcomes, computes the pass rate for ruleId (or
overall if omitted), and fails the check when the rate is below
minPassRate. Tests can inject a fake subprocess runner via the
library API (AgentmdSpawnFn) so CI doesn't need an API key.
iso-eval run examples/suites/echo-basic/eval.yml
iso-eval plan examples/suites/echo-basic/eval.yml
iso-eval run eval.yml --filter write-greeting --concurrency 2 --json
iso-eval run eval.yml --runner claude-code --harness-source ../dist
iso-eval run eval.yml --runner cursor --harness-source ../dist
iso-eval run eval.yml --runner opencode --harness-source ../dist
iso-eval run eval.yml --keep-workspaces # skip tmpdir cleanup for debugging
run exits 0 on all-pass, 1 on any failure, 2 on invalid invocation.
--runner and --harness-source let you replay the same suite through a
different packaged harness without rewriting checks.yml.
Set runner: in YAML, or override it at the CLI with --runner.
harness.source is optional; when present, iso-eval stages the generated
harness files you want the runner to see into each snapshotted workspace.
codexsuite: refactor-basic
runner: codex
timeoutMs: 180000
harness:
source: ../dist
Accepted harness.source shapes:
AGENTS.md and/or .codex/AGENTS.md path.codex/config.toml pathclaude-codeAccepted harness.source shapes:
CLAUDE.md, .claude/, and/or .mcp.jsonCLAUDE.md path.claude/ path.claude/settings.json path.mcp.json pathThe runner shells out to claude -p --no-session-persistence and passes
.mcp.json through --mcp-config when present.
opencodeAccepted harness.source shapes:
AGENTS.md, opencode.json, and/or .opencode/AGENTS.md pathopencode.json path.opencode/ pathThe runner shells out to opencode run --dir <workspace> and defaults to
--pure so each trial stays self-contained.
cursorAccepted harness.source shapes:
.cursor/, AGENTS.md, and/or CLAUDE.md.cursor/ path.cursor/rules/ path.cursor/rules/*.mdc path.cursor/mcp.json pathAGENTS.md pathCLAUDE.md pathThe runner shells out to cursor-agent --print --output-format text --workspace <workspace> and stages any Cursor harness files you exported
with iso-harness into the per-trial workspace first.
This lets one suite exported from iso-trace be replayed across the
packaged runners with the same task prompt and checks.
import { loadSuite, run, formatReport, fakeRunner } from "@agent-pattern-labs/iso-eval";
const suite = loadSuite("./eval.yml");
const report = await run(suite, {
runner: fakeRunner,
concurrency: 2,
onTaskComplete: (t) => console.log(t.id, t.passed ? "✓" : "✗"),
});
console.log(formatReport(report));
process.exit(report.passed ? 0 : 1);
The YAML runner: field selects from shipped runners; the library
accepts any RunnerFn:
import type { RunnerFn } from "@agent-pattern-labs/iso-eval";
const myRunner: RunnerFn = async ({ workspaceDir, taskPrompt, timeoutMs, harnessSource }) => {
// spawn your agent (claude -p / codex exec / …) with cwd = workspaceDir
// optionally stage files from harnessSource before invoking it
// return { exitCode, stdout, stderr, durationMs }
};
llm_judge checks)import type { JudgeFn } from "@agent-pattern-labs/iso-eval";
const judge: JudgeFn = async (prompt, output) => {
// call your model; return true if the rule was followed
};
await run(suite, { runner: fakeRunner, judge });
agent.md → agentmd lint → agentmd render → isolint lint → iso-harness build
│
▼
project w/ CLAUDE.md etc.
│ iso-eval run
▼
per-task pass / fail
@agent-pattern-labs/agentmd measures per-rule adherence on text output
(input string → output string → check).@agent-pattern-labs/iso-eval measures task success on a real workspace
(snapshot dir → agent acts → filesystem state → check).The two compose: an iso-eval suite can include llm_judge checks that
reuse the same judge convention (yes = rule followed), plus
agentmd_adherence checks that fold a fixture-level adherence score into
the task report.
MIT — see LICENSE.
FAQs
Behavioral eval runner for AI coding agents — snapshot a workspace, hand it to a runner with a task prompt, score the resulting filesystem/git state.
The npm package @agent-pattern-labs/iso-eval receives a total of 27 weekly downloads. As such, @agent-pattern-labs/iso-eval popularity was classified as not popular.
We found that @agent-pattern-labs/iso-eval demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
AI agents are pulling packages into environments no scanner is watching, creating exposure before security teams can see it.

Security News
GitHub Actions checkout now blocks risky pull_request_target checkouts by default to help prevent pwn request supply chain attacks.

Product
Socket now supports Custom Roles and Repository Access Permissions so organizations can control who can access specific repositories and actions.