Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement
Sign In

@forwardimpact/libeval

Package Overview
Dependencies
Maintainers
1
Versions
50
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@forwardimpact/libeval

Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.

Source
npmnpm
Version
0.1.48
Version published
Weekly downloads
504
-49.3%
Maintainers
1
Weekly downloads
 
Created
Source

libeval

Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.

libeval provides the runtime and tool surface for multi-LLM coordination — an agent talks to a supervisor, a facilitator chairs a meeting, a lead drives an asynchronous discussion — plus a CLI suite that runs evals, queries the traces they produce, and edits skill files under controlled conditions.

CLIs

CLIPurpose
fit-evalRun agents in run/supervise/facilitate/discuss subcommands.
fit-traceDownload, query, and analyze NDJSON traces produced by fit-eval.
fit-benchmarkRun task families for N runs each and aggregate pass@k.
fit-selfeditWrite stdin to .claude/** paths, gated by settings.json + branch.

fit-eval's subcommands share one orchestration loop and one async tool surface, below. The judge role is a profile passed to supervise.

Modes

ModeLeadParticipantsTerminal tool
run(none)one agenttask completion
supervisesupervisorone agentConclude
facilitatefacilitatorN namedConclude
discussleadN namedAdjourn or Recess
judgejudge(none)Conclude

run and judge are one-shot. The other three share OrchestrationLoop plus an async Ask/Answer/Announce/RollCall tool surface; the loop fans messages out over an in-memory bus and emits a {source, seq, event} NDJSON envelope for every line.

Async Ask / Answer / Announce

Ask({ question, to? })       →  { askIds: [N, …] }
Answer({ message, askId? })  →  routed to the asker
Announce({ message })        →  broadcast, no reply expected

Every Ask returns immediately and registers a pending entry keyed by an askId. The reply arrives later on the asker's inbox as [answer#N] <participant>: <text>. Broadcast: omit to on a multi-participant lead. Answer's askId is optional — the handler is forgiving:

  • Provided + matches an ask owed by the caller → routes to that asker.
  • Provided but unknown or wrong addresseeisError with a pointed message.
  • Omitted + exactly one ask owed to the caller → auto-picks it.
  • Omitted + 0 or many asks owed → broadcasts as Announce.

Inbox lines on resume:

[ask#42]     facilitator: What is your current condition?
[answer#41]  agent-1:     We're at 7 out of 10.
[shared]     agent-2:     FYI I'm switching to Bun 1.2.
[system]     @orchestrator: You have an unanswered ask from facilitator (askId=42)…

Async means the lead can issue Asks, end its turn, and plan in the gap while participants work in parallel — nothing blocks the LLM thread.

Discuss-mode replies

In discussion mode, Answer calls routed to the lead are captured as thread replies delivered via the bridge callback. The lead delegates work via Ask; each agent's Answer becomes a separate reply posted to the discussion thread. No explicit reply tool is needed on the lead surface — the message bus intercepts answers and appends them to ctx.replies[].

RequestForComment is a separate coordination tool available on agent roles (facilitated agents and discuss agents). It queues an intent to open a new Discussion thread for long-horizon coordination on open questions; these are accumulated in ctx.rfcs[], separate from the thread replies in ctx.replies[].

Orchestration loop

Each participant drains the bus (or waits), runs/resumes the LLM with drained messages as tagged lines, and on an unanswered owed Ask injects one synthetic reminder before emitting protocol_violation and unblocking the asker with a synthetic null answer.

Termination uses two flags. ctx.concluded is explicit Conclude/Adjourn/Recess — also cancels in-flight Asks so askers see why their question won't be answered. stopped is broader: lead error, agent crash, abort path. Loops watch stopped; ctx.concluded only feeds the summary's success/verdict.

Tool surface, by role

RoleAskAnswerAnnounceRollCallConcludeOther
Facilitator
Fac. agentRequestForComment
Supervisor
Sup. agent
Discuss leadRecess, Adjourn
Discuss agtRequestForComment
Judge

Ask's to accepts a participant name on multi-participant roles (facilitator, discuss lead, all participants). The supervise pair has only one possible target so to is rejected there.

Minimal example: two-participant facilitator

import { createFacilitator, createRedactor } from "@forwardimpact/libeval";
import { query } from "@anthropic-ai/claude-agent-sdk";

const facilitator = createFacilitator({
  facilitatorCwd: process.cwd(),
  agentConfigs: [
    { name: "alice", role: "explorer", agentProfile: "alice" },
    { name: "bob",   role: "tester",   agentProfile: "bob" },
  ],
  query,
  output: process.stdout,
  redactor: createRedactor(),
  facilitatorProfile: "improvement-coach",
});

const result = await facilitator.run("Run a kata storyboard meeting.");
// result.success / result.turns / NDJSON trace on process.stdout

The facilitator gets Ask/Answer/Announce/RollCall/Conclude; each agent gets the same minus Conclude. Every tool call, bus message, and orchestrator event becomes one trace line.

Trace format and redaction

Each line is { "source": "<participant|orchestrator>", "seq": N, "event": {…} }. seq is monotonic across the whole trace; orchestrator emits session_start, agent_start, protocol_violation, lead_turn_limit, and summary. event is the SDK event verbatim or the orchestrator payload. fit-trace consumes this format.

Redaction is on by default for fit-eval run/supervise/facilitate and composes two layers:

  • Env-var allowlistANTHROPIC_API_KEY, GH_TOKEN, GITHUB_TOKEN by default; override with LIBEVAL_REDACTION_ENV_VARS=NAME1,… (replaces, not extends). Runtime values become [REDACTED:env:NAME] everywhere they appear.
  • Credential-shape patternssk-ant-, ghp_, ghs_, gho_, github_pat_. Hits become [REDACTED:pattern:KIND].

Set LIBEVAL_REDACTION_DISABLED=1 to disable (one stderr warning per run). Never on CI for a public repo — workflow artifacts are downloadable through retention.

Module map

ModulePurpose
agent-runner.jsOne Claude Agent SDK session; emits NDJSON via the redactor.
message-bus.jsPer-participant queues + waitForMessages Promise wakeup.
orchestration-toolkit.jsShared Ask/Answer/Announce/Conclude/RollCall/RequestForComment handlers + builders.
orchestration-loop.jsUnified lead+participant loop; reminder/violation handling.
facilitator.js / supervisor.js / discusser.js / judge.jsPer-mode class + factory + system prompt.
discuss-tools.jsDiscuss-only Recess/Adjourn.
trace-collector.js / trace-query.js / trace-github.jsTrace ingestion / querying / GitHub-attachment helpers.
redaction.jsEnv-var allowlist + credential-shape pattern redaction.

fit-selfedit

A narrow, audited bypass for sessions where Edit/Write (and bash writes) are blocked against paths the project's own allowlist permits — see #1162 and #441 for the original episodes. Reads stdin, writes the target, exits 0 / 2 (safeguard violation) / 1 (I/O error).

echo "<content>" | bunx fit-selfedit <path>

Two safeguards, checked in order:

  • Settings-allow. Walk upward from the target with Finder.findUpward to find the nearest .claude/settings.json. The target relative to its grandparent directory must match at least one Edit(<glob>) rule in permissions.allow[] (matched with minimatch, dot: true). Settings.json is the single source of truth — widen the project allowlist and the CLI follows. Traversal like .claude/../README.md is rejected as a side effect: path.resolve collapses .. first, then the resolved path tests against the rules.

  • Branch scope. git rev-parse --abbrev-ref HEAD must not be HEAD (detached) or main. Edits ride a feature branch through whatever merge gates the project has configured.

Failure messages name the safeguard that rejected; safeguard 1 also lists the Edit() rules that were tried.

Documentation

Keywords

eval

FAQs

Package last updated on 26 May 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts