
Company News
Socket Partners with Replit to Block Malicious Packages in AI-Powered Development
Replit is integrating Socket Firewall into its AI-powered development experience to help protect builders from malicious open source packages.
@forwardimpact/libeval
Advanced tools
Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.
Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.
libeval provides the runtime and tool surface for multi-LLM coordination —
an agent talks to a supervisor, a facilitator chairs a meeting, a lead drives
an asynchronous discussion — plus a CLI suite that runs evals, queries the
traces they produce, and edits skill files under controlled conditions.
| CLI | Purpose |
|---|---|
fit-eval | Run agents in run/supervise/facilitate/discuss subcommands. |
fit-trace | Download, query, and analyze NDJSON traces produced by fit-eval. |
fit-benchmark | Run task families for N runs each and aggregate pass@k. |
fit-selfedit | Write stdin to .claude/** paths, gated by settings.json + branch. |
fit-eval's subcommands share one orchestration loop and one async tool
surface, below. The judge role is a profile passed to supervise.
| Mode | Lead | Participants | Terminal tool |
|---|---|---|---|
run | (none) | one agent | task completion |
supervise | supervisor | one agent | Conclude |
facilitate | facilitator | N named | Conclude |
discuss | lead | N named | Adjourn or Recess |
judge | judge | (none) | Conclude |
run and judge are one-shot. The other three share OrchestrationLoop
plus an async Ask/Answer/Announce/RollCall tool surface; the loop fans
messages out over an in-memory bus and emits a {source, seq, event}
NDJSON envelope for every line.
Ask({ question, to? }) → { askIds: [N, …] }
Answer({ message, askId? }) → routed to the asker
Announce({ message }) → broadcast, no reply expected
Every Ask returns immediately and registers a pending entry keyed by an
askId. The reply arrives later on the asker's inbox as [answer#N] <participant>: <text>. Broadcast: omit to on a multi-participant
lead. Answer's askId is optional — the handler is forgiving:
isError with a pointed message.Inbox lines on resume:
[ask#42] facilitator: What is your current condition?
[answer#41] agent-1: We're at 7 out of 10.
[shared] agent-2: FYI I'm switching to Bun 1.2.
[system] @orchestrator: You have an unanswered ask from facilitator (askId=42)…
Async means the lead can issue Asks, end its turn, and plan in the gap while participants work in parallel — nothing blocks the LLM thread.
In discussion mode, Answer calls routed to the lead are streamed to
the discussion thread as they are produced — each agent's Answer becomes
a separate reply posted immediately, not batched at session end. The
lead and agents can also call Acknowledge to post brief messages
directly to the thread (status updates, human follow-up responses).
The message bus intercepts answers and appends them to ctx.replies[].
RequestForComment is a separate coordination tool available on agent
roles (facilitated agents and discuss agents). It queues an intent to
open a new Discussion thread for long-horizon coordination on open
questions; these are accumulated in ctx.rfcs[], separate from the
thread replies in ctx.replies[].
Each participant drains the bus (or waits), runs/resumes the LLM with
drained messages as tagged lines, and on an unanswered owed Ask injects
one synthetic reminder before emitting protocol_violation and
unblocking the asker with a synthetic null answer.
Termination uses two flags. ctx.concluded is explicit
Conclude/Adjourn/Recess — also cancels in-flight Asks so askers
see why their question won't be answered. stopped is broader: lead
error, agent crash, abort path. Loops watch stopped; ctx.concluded
only feeds the summary's success/verdict.
| Role | Ask | Answer | Announce | RollCall | Conclude | Other |
|---|---|---|---|---|---|---|
| Facilitator | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Fac. agent | ✓ | ✓ | ✓ | ✓ | RequestForComment | |
| Supervisor | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Sup. agent | ✓ | ✓ | ✓ | ✓ | ||
| Discuss lead | ✓ | ✓ | ✓ | ✓ | Recess, Adjourn, Acknowledge | |
| Discuss agt | ✓ | ✓ | ✓ | ✓ | RequestForComment, Acknowledge | |
| Judge | ✓ |
Ask's to accepts a participant name on multi-participant roles
(facilitator, discuss lead, all participants). The supervise pair has
only one possible target so to is rejected there.
import { createFacilitator, createRedactor } from "@forwardimpact/libeval";
import { query } from "@anthropic-ai/claude-agent-sdk";
const facilitator = createFacilitator({
facilitatorCwd: process.cwd(),
agentConfigs: [
{ name: "alice", role: "explorer", agentProfile: "alice" },
{ name: "bob", role: "tester", agentProfile: "bob" },
],
query,
output: process.stdout,
redactor: createRedactor(),
facilitatorProfile: "improvement-coach",
});
const result = await facilitator.run("Run a kata storyboard meeting.");
// result.success / result.turns / NDJSON trace on process.stdout
The facilitator gets Ask/Answer/Announce/RollCall/Conclude;
each agent gets the same minus Conclude. Every tool call, bus
message, and orchestrator event becomes one trace line.
Each line is { "source": "<participant|orchestrator>", "seq": N, "event": {…} }. seq is monotonic across the whole trace; orchestrator emits
session_start, agent_start, protocol_violation, lead_turn_limit,
and summary. event is the SDK event verbatim or the orchestrator
payload. fit-trace consumes this format.
Redaction is on by default for fit-eval run/supervise/facilitate
and composes two layers:
ANTHROPIC_API_KEY, GH_TOKEN, GITHUB_TOKEN
by default; override with LIBEVAL_REDACTION_ENV_VARS=NAME1,…
(replaces, not extends). Runtime values become [REDACTED:env:NAME]
everywhere they appear.sk-ant-, ghp_, ghs_, gho_,
github_pat_. Hits become [REDACTED:pattern:KIND].Set LIBEVAL_REDACTION_DISABLED=1 to disable (one stderr warning per
run). Never on CI for a public repo — workflow artifacts are
downloadable through retention.
| Module | Purpose |
|---|---|
agent-runner.js | One Claude Agent SDK session; emits NDJSON via the redactor. |
message-bus.js | Per-participant queues + waitForMessages Promise wakeup. |
orchestration-toolkit.js | Shared Ask/Answer/Announce/Conclude/RollCall/RequestForComment handlers + builders. |
orchestration-loop.js | Unified lead+participant loop; reminder/violation handling. |
facilitator.js / supervisor.js / discusser.js / judge.js | Per-mode class + factory + system prompt. |
discuss-tools.js | Discuss-only Recess/Adjourn/Acknowledge. |
reply-emitter.js | Fire-and-forget POST of reply/ack events to the callback URL. |
inbox-poller.js | Long-poll the bridge inbox for injected human messages. |
trace-collector.js / trace-query.js / trace-github.js | Trace ingestion / querying / GitHub-attachment helpers. |
redaction.js | Env-var allowlist + credential-shape pattern redaction. |
A narrow, audited bypass for sessions where Edit/Write (and bash
writes) are blocked against paths the project's own allowlist permits.
Reads stdin, writes the target, exits 0 / 2 (safeguard violation) / 1
(I/O error).
echo "<content>" | bunx fit-selfedit <path>
Two safeguards, checked in order:
Settings-allow. Walk upward from the target with
Finder.findUpward to find the nearest
.claude/settings.json. The target relative to its grandparent
directory must match at least one Edit(<glob>) rule in
permissions.allow[] (matched with
minimatch, dot: true).
Settings.json is the single source of truth — widen the project
allowlist and the CLI follows. Traversal like .claude/../README.md
is rejected as a side effect: path.resolve collapses .. first,
then the resolved path tests against the rules.
Branch scope. git rev-parse --abbrev-ref HEAD must not be
HEAD (detached) or main. Edits ride a feature branch through
whatever merge gates the project has configured.
Failure messages name the safeguard that rejected; safeguard 1 also
lists the Edit() rules that were tried.
fit-trace.FAQs
Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.
The npm package @forwardimpact/libeval receives a total of 367 weekly downloads. As such, @forwardimpact/libeval popularity was classified as not popular.
We found that @forwardimpact/libeval demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Company News
Replit is integrating Socket Firewall into its AI-powered development experience to help protect builders from malicious open source packages.

Security News
npm confirmed a tooling bug incorrectly marked several one-character packages as security holders and said it was working on a rollback.

Research
/Security News
Newer packages in this compromise use native extensions and .pth loaders to execute JavaScript stealers in developer environments.