libeval provides the runtime and tool surface for multi-LLM coordination — an agent talks to a supervisor, a facilitator chairs a meeting, a lead drives an asynchronous discussion — plus a CLI suite that runs evals, queries the traces they produce, and edits skill files under controlled conditions.

CLIs

CLI	Purpose
`fit-eval`	Run agents in `run`/`supervise`/`facilitate`/`discuss` subcommands.
`fit-trace`	Download, query, and analyze NDJSON traces produced by `fit-eval`.
`fit-benchmark`	Run task families for N runs each and aggregate pass@k.
`fit-selfedit`	Write stdin to `.claude/**` paths, gated by settings.json + branch.

fit-eval's subcommands share one orchestration loop and one async tool surface, below. The judge role is a profile passed to supervise.

Modes

Mode	Lead	Participants	Terminal tool
`run`	(none)	one agent	task completion
`supervise`	`supervisor`	one `agent`	`Conclude`
`facilitate`	`facilitator`	N named	`Conclude`
`discuss`	`lead`	N named	`Adjourn` or `Recess`
`judge`	`judge`	(none)	`Conclude`

run and judge are one-shot. The other three share OrchestrationLoop plus an async Ask/Answer/Announce/RollCall tool surface; the loop fans messages out over an in-memory bus and emits a {source, seq, event} NDJSON envelope for every line.

Async Ask / Answer / Announce

Ask({ question, to? })       →  { askIds: [N, …] }
Answer({ message, askId? })  →  routed to the asker
Announce({ message })        →  broadcast, no reply expected

Every Ask returns immediately and registers a pending entry keyed by an askId. The reply arrives later on the asker's inbox as [answer#N] <participant>: <text>. Broadcast: omit to on a multi-participant lead. Answer's askId is optional — the handler is forgiving:

Provided + matches an ask owed by the caller → routes to that asker.
Provided but unknown or wrong addressee → isError with a pointed message.
Omitted + exactly one ask owed to the caller → auto-picks it.
Omitted + 0 or many asks owed → broadcasts as Announce.

Inbox lines on resume:

[ask#42]     facilitator: What is your current condition?
[answer#41]  agent-1:     We're at 7 out of 10.
[shared]     agent-2:     FYI I'm switching to Bun 1.2.
[system]     @orchestrator: You have an unanswered ask from facilitator (askId=42)…

Async means the lead can issue Asks, end its turn, and plan in the gap while participants work in parallel — nothing blocks the LLM thread.

Discuss-mode replies

In discussion mode, Answer calls routed to the lead are streamed to the discussion thread as they are produced — each agent's Answer becomes a separate reply posted immediately, not batched at session end. The lead and agents can also call Acknowledge to post brief messages directly to the thread (status updates, human follow-up responses). The message bus intercepts answers and appends them to ctx.replies[].

RequestForComment is a separate coordination tool available on agent roles (facilitated agents and discuss agents). It queues an intent to open a new Discussion thread for long-horizon coordination on open questions; these are accumulated in ctx.rfcs[], separate from the thread replies in ctx.replies[].

Orchestration loop

Each participant drains the bus (or waits), runs/resumes the LLM with drained messages as tagged lines, and on an unanswered owed Ask injects one synthetic reminder before emitting protocol_violation and unblocking the asker with a synthetic null answer.

Termination uses two flags. ctx.concluded is explicit Conclude/Adjourn/Recess — also cancels in-flight Asks so askers see why their question won't be answered. stopped is broader: lead error, agent crash, abort path. Loops watch stopped; ctx.concluded only feeds the summary's success/verdict.

Tool surface, by role

Role	Ask	Answer	Announce	RollCall	Conclude	Other
Facilitator	✓	✓	✓	✓	✓
Fac. agent	✓	✓	✓	✓		`RequestForComment`
Supervisor	✓	✓	✓	✓	✓
Sup. agent	✓	✓	✓	✓
Discuss lead	✓	✓	✓	✓		`Recess`, `Adjourn`, `Acknowledge`
Discuss agt	✓	✓	✓	✓		`RequestForComment`, `Acknowledge`
Judge					✓

Ask's to accepts a participant name on multi-participant roles (facilitator, discuss lead, all participants). The supervise pair has only one possible target so to is rejected there.

Minimal example: two-participant facilitator

import { createFacilitator, createRedactor } from "@forwardimpact/libeval";
import { query } from "@anthropic-ai/claude-agent-sdk";

const facilitator = createFacilitator({
  facilitatorCwd: process.cwd(),
  agentConfigs: [
    { name: "alice", role: "explorer", agentProfile: "alice" },
    { name: "bob",   role: "tester",   agentProfile: "bob" },
  ],
  query,
  output: process.stdout,
  redactor: createRedactor(),
  facilitatorProfile: "improvement-coach",
});

const result = await facilitator.run("Run a kata storyboard meeting.");
// result.success / result.turns / NDJSON trace on process.stdout

The facilitator gets Ask/Answer/Announce/RollCall/Conclude; each agent gets the same minus Conclude. Every tool call, bus message, and orchestrator event becomes one trace line.

Trace format and redaction

Each line is { "source": "<participant|orchestrator>", "seq": N, "event": {…} }. seq is monotonic across the whole trace; orchestrator emits session_start, agent_start, protocol_violation, lead_turn_limit, and summary. event is the SDK event verbatim or the orchestrator payload. fit-trace consumes this format.

Redaction is on by default for fit-eval run/supervise/facilitate and composes two layers:

Env-var allowlist — ANTHROPIC_API_KEY, GH_TOKEN, GITHUB_TOKEN by default; override with LIBEVAL_REDACTION_ENV_VARS=NAME1,… (replaces, not extends). Runtime values become [REDACTED:env:NAME] everywhere they appear.
Credential-shape patterns — sk-ant-, ghp_, ghs_, gho_, github_pat_. Hits become [REDACTED:pattern:KIND].

Set LIBEVAL_REDACTION_DISABLED=1 to disable (one stderr warning per run). Never on CI for a public repo — workflow artifacts are downloadable through retention.

Module map

Module	Purpose
`agent-runner.js`	One Claude Agent SDK session; emits NDJSON via the redactor.
`message-bus.js`	Per-participant queues + `waitForMessages` Promise wakeup.
`orchestration-toolkit.js`	Shared Ask/Answer/Announce/Conclude/RollCall/RequestForComment handlers + builders.
`orchestration-loop.js`	Unified lead+participant loop; reminder/violation handling.
`facilitator.js` / `supervisor.js` / `discusser.js` / `judge.js`	Per-mode class + factory + system prompt.
`discuss-tools.js`	Discuss-only `Recess`/`Adjourn`/`Acknowledge`.
`reply-emitter.js`	Fire-and-forget POST of reply/ack events to the callback URL.
`inbox-poller.js`	Long-poll the bridge inbox for injected human messages.
`trace-collector.js` / `trace-query.js` / `trace-github.js`	Trace ingestion / querying / GitHub-attachment helpers.
`redaction.js`	Env-var allowlist + credential-shape pattern redaction.

fit-selfedit

A narrow, audited bypass for sessions where Edit/Write (and bash writes) are blocked against paths the project's own allowlist permits. Reads stdin, writes the target, exits 0 / 2 (safeguard violation) / 1 (I/O error).

echo "<content>" | bunx fit-selfedit <path>

Two safeguards, checked in order:

Settings-allow. Walk upward from the target with Finder.findUpward to find the nearest .claude/settings.json. The target relative to its grandparent directory must match at least one Edit(<glob>) rule in permissions.allow[] (matched with minimatch, dot: true). Settings.json is the single source of truth — widen the project allowlist and the CLI follows. Traversal like .claude/../README.md is rejected as a side effect: path.resolve collapses .. first, then the resolved path tests against the rules.
Branch scope. git rev-parse --abbrev-ref HEAD must not be HEAD (detached) or main. Edits ride a feature branch through whatever merge gates the project has configured.

Failure messages name the safeguard that rejected; safeguard 1 also lists the Edit() rules that were tried.

Documentation

Agent Evaluations Guide — how to run an eval and read its trace.
Agent Collaboration Guide — supervise / facilitate / discuss in depth.
Trace Analysis Guide — analysing NDJSON traces with fit-trace.

Keywords

FAQs

What is @forwardimpact/libeval?

Is @forwardimpact/libeval popular?

Is @forwardimpact/libeval well maintained?

Package last updated on 10 Jun 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@forwardimpact/libeval

libeval