
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
CLI for evaluating AI agent harnesses across GitHub Copilot, Claude Code, Codex, and custom agent workflows.
47 deterministic checks | Passive observation | 4 harness targets | Ablation attribution
BS Buster is a CLI for evaluating AI agent harnesses. It helps teams measure the orchestration layer behind GitHub Copilot, Claude Code, Codex, and custom agent workflows so you can separate model problems from harness problems.
Best for: AI engineering teams, agent platform builders, developer tooling teams, and anyone comparing coding assistants in real workflows.
Install locally in your project:
npm install bs-buster
That's it. No build step, no config files, no Docker. One command.
After installing, all commands use npx:
npx bs-buster init
npx bs-buster start
npx bs-buster stop
npx bs-buster report
npm install -g bs-buster
bs-buster init # no npx needed when global
Insert coin. Bust BS.
Note: If you installed locally (not globally), you must use
npxto run commands. Runningbs-buster startdirectly won't work — usenpx bs-buster startinstead.
npx bs-buster init
BS Buster Setup
Don't blame the model. Measure the harness.
───────────────────────────────────────────
Scanning for installed harnesses...
[✓] Claude Code Found .jsonl files in ~/.claude/projects (high confidence)
[✓] GitHub Copilot Found extension(s) in ~/.vscode-insiders (high confidence)
[✗] OpenAI Codex Not detected
[✓] Generic Filesystem watcher available for any harness
Select your primary harness:
1) Claude Code (recommended)
2) GitHub Copilot
3) Generic
Enter number [1]: 2
Configuration saved to .bs-buster/config.json
Auto-detects Claude Code, Codex, and Copilot (including VS Code Insiders). Also checks your project root for .claude/, CLAUDE.md, and .github/copilot-instructions.md. Saves config so all future commands need zero flags.
npx bs-buster start # Starts background observer
# ... use your harness normally ...
npx bs-buster stop # Stops observer, finalizes data
npx bs-buster report # Generates an HTML report
The report includes:
Report formats: --format html (default), --format json, --format markdown
Run the checks. Find the actual culprit.
npx bs-buster <command> [options]
| Command | Description |
|---|---|
npx bs-buster init | Auto-detect harnesses, save config |
npx bs-buster start | Start background observer |
npx bs-buster stop | Stop observer, finalize events |
npx bs-buster status | Check if observer is running |
npx bs-buster report | Generate evaluation report |
npx bs-buster sessions | List past observation sessions |
| Option | Description | Default |
|---|---|---|
--target, -t | claude-code, codex, copilot, generic | auto (from init) |
--dir, -d | Workspace directory to watch | . |
--format, -f | Report format: html, json, markdown | html |
--session, -s | Session ID for report | latest |
--harness-name | Label the harness being tested (e.g., "Ruflo Swarm") | auto |
--harness-desc | Custom description for the harness in reports | auto |
# Override saved config:
npx bs-buster start --target codex --dir ./my-project
npx bs-buster report --session abc123 --format json
The AI industry has a hand-waving problem. When an agent fails — hallucinating, looping, destroying production data — the diagnosis is always the same: "the model needs to be smarter."
This is the model attribution error, and it is the most expensive misdiagnosis in AI engineering.
Same model, different harness = different outcomes. The model was fine. The orchestration wasn't.
"Your agent's reliability problem is not a model problem. We can prove it."
| The BS Claim | What People Say | What the Data Shows |
|---|---|---|
| Model Blame | "We need a smarter model" | Same model, different harness = different outcomes |
| Benchmark Theater | "Our model scores 92% on SWE-bench" | Benchmarks test models in isolation. Production agents fail at the harness layer |
| Prompt Engineering | "We just need better prompts" | Prompts break on model updates. Harness guarantees persist across models |
| Context Window Copium | "We need a bigger context window" | You have 200K tokens and filled them with raw dumps. Lifecycle management was nonexistent |
| Scale Solves Everything | "Next year's model will fix it" | Next year's model still needs stopping rules, tool schemas, and a policy layer |
| Alignment Hand-Wave | "The model doesn't understand consequences" | You handed it rm -rf and git push --force with no permission gate |
The industry's default diagnosis, visualized.
Two real harness evaluations. Same evaluation framework. Different orchestration maturity.
| Metric | Ruflo Swarm (A-) | ATV Agent Orchestrator (B) |
|---|---|---|
| Underlying Assistant | Claude Code | GitHub Copilot |
| Overall Score | 91.1 / 100 | 83.7 / 100 |
| Grade | A- | B |
| Observation Coverage | 100% | 75% (no token tracking) |
| Turns / Tool Calls | 12 turns, 16 tool calls | 381 turns, 103 tool calls |
| Worst Pillar | Context Assembly at 87.5% (4 of 5 pillars at 100%) | Loop Discipline at 33.3% (4 of 5 pillars at 100%) |
| Attribution (harness vs model) | 0% harness failures -- all minor issues attributed to model behavior | 0% harness failures -- eval framework-specific checks were corrected |
| Infinite Loop Signature | Zero infinite loops, zero idle turns, explicit termination | 381 turns with 99% duplicate tool calls, 26 idle turns, no checkpoints |
Don't blame the model. Measure the harness. Both Claude Code and GitHub Copilot are capable coding assistants -- the difference in scores reflects harness orchestration maturity, not model quality. Ruflo's mature harness design -- iteration ceilings, idle detection, progress tracking -- prevented the failure modes that ATV's orchestrator still exhibits. The remaining gap is a harness engineering gap: loop discipline and checkpoint strategy, not model intelligence.
Full reports: Ruflo Swarm (A-) | ATV Agent Orchestrator (B) | Comparative Analysis
When the harness is missing a policy layer, even HAL has receipts.
| # | Pillar | What It Measures | Checks |
|---|---|---|---|
| 1 | Context Assembly | System prompt, tool declarations, state injection | 7 |
| 2 | Tool Integrity | Schema validation, error messages, self-correction | 8 |
| 3 | Loop Discipline | Stopping rules, iteration ceilings, stall detection | 10 |
| 4 | Policy Enforcement | Permission gates, destructive action blocking | 8 |
| 5 | Context Lifecycle | Token management, compaction, delegation | 8 |
| — | Cross-Pillar | Emergent interactions between pillars | 6 |
BS Buster uses passive observation — it watches your harness operate in its natural environment and reconstructs what happened from side effects. No synthetic sandboxes, no test doubles.
npx bs-buster start → Observer watches harness in background
(use your harness normally)
npx bs-buster stop → Finalize event collection
npx bs-buster report → Reconstruct → 47 Checks → Attribution → Score → HTML Report
fs.watch / file tailing
→ Observer emits ObserverEvents
→ EventCollector writes JSONL to disk
→ reconstructOutput() builds AgentOutput
→ evaluateObservation() runs 47 checks
→ Scoring engine → HarnessReport
Every check is a pure function: (AgentOutput) → { pass, detail }. No model calls. No randomness. Reproducible across runs.
| Target | Observer Method | Captures |
|---|---|---|
| Claude Code | JSONL tail + workspace fs.watch | Model responses, tool calls, token usage, file changes |
| Codex | JSON output tail + workspace fs.watch | Model responses, tool calls, token usage, file changes |
| Copilot | Workspace snapshot + fs.watch + .git watcher | File changes, git activity |
| Generic | Workspace fs.watch with glob patterns | File changes (any harness) |
import {
evaluateObservation,
reconstructOutput,
EventCollector,
generateHtmlReport,
} from "bs-buster";
// Read collected events
const events = EventCollector.readEvents("path/to/session.events.jsonl");
// Reconstruct agent output from raw events
const result = reconstructOutput(events, sessionId, target, harnessId, modelId);
// Run 47-check evaluation
const report = await evaluateObservation(
result.output,
"claude-code",
result.observation_coverage
);
console.log(`Score: ${report.overall_score} (${report.overall_grade})`);
// Generate standalone HTML report
const html = generateHtmlReport({
session_id: sessionId,
target: "claude-code",
observation_coverage: result.observation_coverage,
warnings: result.warnings,
evaluation: report,
summary: { turns: 830, tool_calls: 976, total_tokens: 366570, duration_ms: 0 },
});
src/
├── cli/ # CLI entry point
│ ├── index.ts # Argument parsing, 6 commands
│ ├── init-wizard.ts # Guided setup with auto-detection
│ ├── harness-detector.ts # Scans system for installed harnesses
│ ├── html-report.ts # Self-contained HTML report generator
│ ├── config.ts # .bs-buster/config.json persistence
│ └── daemon.ts # Background process with PID management
├── observer/ # Passive observation engine
│ ├── types.ts # HarnessTarget, ObserverEvent, HarnessObserver
│ ├── eval-bridge.ts # Bridges observation → eval pipeline
│ ├── event-collector.ts # JSONL event writer/reader
│ ├── observer-registry.ts # Factory for target-specific observers
│ ├── observers/ # Per-harness observer implementations
│ │ ├── claude-code.observer.ts # JSONL tail + workspace watcher
│ │ ├── codex.observer.ts # JSON output tail + workspace watcher
│ │ ├── copilot.observer.ts # Workspace snapshot + fs.watch + git
│ │ └── generic.observer.ts # Glob-filtered filesystem watcher
│ └── reconstruction/
│ └── output-builder.ts # Events → AgentOutput reconstruction
└── eval/ # Evaluation engine
├── types.ts # 40+ interfaces
├── check-registry.ts # Registration, lookup, lazy ESM loading
├── checks/ # 47 deterministic checks
├── scoring/ # Weighted composite scoring + attribution
├── reporters/ # JSON & Markdown renderers
└── harnesses/ # Ablation testing
| Document | What It Is |
|---|---|
| Agent Harness Whitepaper | The thesis: five pillars framework, failure modes, architectural patterns |
| BS Buster Philosophy | How 47 checks kill the model attribution error |
| Architecture | System architecture, module design, data flow |
| Eval Methodology | Phil Schmid adaptation for harness testing |
| Pillar Strategy | Per-pillar check design and metrics |
| Comparative Analysis | Real-world A- vs B harness comparison: Ruflo Swarm vs ATV Agent Orchestrator |
zod for schema validation)FAQs
CLI for evaluating AI agent harnesses across GitHub Copilot, Claude Code, Codex, and custom agent workflows.
The npm package bs-buster receives a total of 79 weekly downloads. As such, bs-buster popularity was classified as not popular.
We found that bs-buster demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.