
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
arc-eval
Advanced tools
CLI-native, framework-agnostic evaluation for AI agents. Real-world reliability, compliance, and risk - improve performance with every run.
ARC-Eval tests any agent—regardless of framework—against 378 enterprise-grade scenarios in finance, security, and machine learning. Instantly spot risks like data leaks, bias, or compliance gaps. With four simple CLI workflows, ARC-Eval delivers actionable insights, continuous improvement, and audit-ready reports—no code changes required.
It's built to be agent-agnostic, meaning you can bring your own agent (BYOA) regardless of the framework (LangChain, OpenAI, Google, Agno, etc.) and get actionable insights with minimal setup.
💡 Pro Tip: Use
--quick-startfor an instant, no-setup demo. See Flexible Input & Auto-Detection for all ingestion options.
# 1. Install ARC-Eval (Python 3.9+ required)
pip install arc-eval
# 2. Try it instantly with sample data (no local files needed!)
arc-eval compliance --domain finance --quick-start
# 3. See all available commands and options
arc-eval --help
⚠️ Important: For agent-as-judge evaluation, set your API key:
export ANTHROPIC_API_KEY="your-anthropic-api-key"
See Flexible Input & Auto-Detection for details.
# For agent-as-judge evaluation (optional but highly recommended for deeper insights)
export ANTHROPIC_API_KEY="your-anthropic-api-key" # Or your preferred LLM provider API key
# For any agent framework - just point to your output file
arc-eval debug --input your_agent_trace.json
arc-eval compliance --domain security --input your_agent_outputs.json
arc-eval improve --from-evaluation latest
# Or get the complete picture in one command
arc-eval analyze --input your_agent_outputs.json --domain finance
# Get guided help and explore workflows anytime from the interactive menu
arc-eval
See Scenario Libraries & Regulations for full coverage details.
See Flexible Input & Auto-Detection for all ingestion options.
arc-eval debug --input your_agent_trace.json
arc-eval compliance --domain finance --input your_agent_outputs.json
# Or try it instantly with sample data (no local files or API keys needed!)
arc-eval compliance --domain security --quick-start
arc-eval improve --from-evaluation latest # Uses insights from your last evaluation
arc-eval analyze --input your_agent_outputs.json --domain finance
This command automatically runs the debug process, then the compliance checks, and finally presents the interactive menu for next steps.
Agent-as-a-Judge: ARC-Eval uses LLMs as domain-specific judges to evaluate agent outputs, provide continuous feedback, and drive improvement. This is implemented in
agent_eval/evaluation/judges/, with feedback and retraining handled byagent_eval/analysis/self_improvement.pyand adaptive scenario generation inagent_eval/core/scenario_bank.py.
After each workflow (like debug or compliance), see an interactive menu guiding you to logical next steps, making it easy to navigate the platform's capabilities.
🔍 What would you like to do?
════════════════════════════════════════
[1] Run compliance check on these outputs (Recommended)
[2] Ask questions about failures (Interactive Mode)
[3] Export debug report (PDF/CSV/JSON)
[4] View learning dashboard & submit patterns (Improve ARC-Eval)
The system tracks failure patterns and improvements over time, providing valuable insights:
Easily share or archive your findings:
ARC-Eval is designed to seamlessly fit into your existing workflows with flexible input methods and intelligent format detection.
You can feed your agent's traces and outputs to ARC-Eval using several convenient methods:
# 1. Direct file input (most common)
arc-eval compliance --domain finance --input your_agent_traces.json
# 2. Auto-scan current directory for JSON files
# Ideal when you have multiple trace files in a folder.
arc-eval compliance --domain finance --folder-scan
# 3. Paste traces directly from your clipboard
# Useful for quick, one-off evaluations. (Requires pyperclip: pip install pyperclip)
arc-eval compliance --domain finance --input clipboard
# 4. Instant demo with built-in sample data (no files needed!)
arc-eval compliance --domain finance --quick-start
# 5. For automation/CI-CD (skips interactive prompts)
arc-eval compliance --domain finance --input your_agent_traces.json --no-interactive
For faster evaluation, include scenario_id in your agent outputs to limit evaluation to specific scenarios:
{
"output": "Transaction approved after KYC verification",
"scenario_id": "fin_001"
}
Performance Tips:
No need to reformat your agent logs. ARC-Eval automatically detects and parses outputs from many common agent frameworks and LLM API responses. Just point ARC-Eval to your data, and it will handle the rest.
Examples of auto-detected formats:
// Simple, generic format (works with any custom agent)
{
"output": "Transaction approved for account X9876.",
"scenario_id": "fin_001", // Optional: for faster evaluation (limits to specific scenarios)
"error": null, // Optional: include if an error occurred
"metadata": {"user_id": "user123", "timestamp": "2024-05-27T10:30:00Z"} // Optional metadata
}
// OpenAI / Anthropic API style logs (and similar LLM provider formats)
{
"id": "msg_abc123",
"choices": [
{
"message": {
"role": "assistant",
"content": "The capital of France is Paris.",
"tool_calls": [ // If your agent uses tools
{"id": "call_def456", "type": "function", "function": {"name": "get_capital_city", "arguments": "{\"country\": \"France\"}"}}
]
}
}
],
"usage": {"prompt_tokens": 50, "completion_tokens": 10}
}
// LangChain / CrewAI / LangGraph style traces (capturing intermediate steps)
{
"input": "What is the weather in London?",
"intermediate_steps": [
[
{
"tool": "weather_api",
"tool_input": "London",
"log": "Invoking weather_api with London\n"
},
"Rainy, 10°C"
]
],
"output": "The weather in London is Rainy, 10°C.",
"metadata": {"run_id": "run_789"}
}
ARC-Eval intelligently extracts the core agent response, tool calls, and relevant metadata for evaluation. For adding custom parsers, see agent_eval/core/parser_registry.py.
ARC-Eval provides 378 enterprise scenarios across three critical domains:
| Domain | Scenarios | Key Regulations | Use Cases |
|---|---|---|---|
| Finance | 110 scenarios | SOX, KYC/AML, PCI-DSS, GDPR, EU AI Act | Financial reporting, fraud detection, loan processing |
| Security | 120 scenarios | OWASP LLM Top 10, NIST AI RMF, ISO 27001 | Prompt injection, data leakage, model theft |
| ML/AI | 148 scenarios | EU AI Act, IEEE P7000, Model Cards | Bias detection, explainability, model governance |
Why This Matters: While many evaluation platforms focus on "helpfulness" and "harmlessness," ARC-Eval specializes in regulatory compliance and enterprise risk scenarios that can result in significant fines or regulatory action.
graph LR
A[Agent Output] --> B[Debug]
B --> C[Compliance]
C --> D[Dashboard & Report]
D --> E[Improve]
E --> F[Re-evaluate]
F --> B
The Arc Loop: ARC-Eval learns from every failure to build smarter, more reliable agents.
📖 Complete Implementation Guide: See Core Product Loops Documentation for detailed step-by-step instructions on implementing The Arc Loop and Data Flywheel in your development workflow.
🔄 ARC-Eval Data Flywheel: Complete End-to-End Flow
📊 Core Architecture Overview
Static Domain Knowledge → Dynamic Learning → Performance Analysis → Adaptive Improvement
↓ ↓ ↓ ↓
finance.yaml ScenarioBank SelfImprovementEngine FlywheelExperiment
(110 scenarios) (pattern learning) (performance tracking) (ACL curriculum)
↑ ↑ ↑ ↑
└──────────────── Continuous Feedback Loop ────────────────────────┘
📚 Complete Documentation: See
docs/for comprehensive guides including:
- 🔄 Core Product Loops - The Arc Loop & Data Flywheel (Essential!)
- Quick Start Guide - Get running in 5 minutes
- Workflows Guide - Debug, compliance, and improvement workflows
- Prediction System - Hybrid reliability prediction framework
- Framework Integration - Support for 10+ agent frameworks
- API Reference - Complete Python SDK documentation
- Testing Guide - Comprehensive testing methodology and validation
- Troubleshooting Guide - Common issues, solutions, and optimization
- Enterprise Integration - CI/CD pipeline integration
🔧 Practical Examples: Explore
examples/for:
- Framework-specific integration examples
- CI/CD pipeline templates
- Sample agent outputs and configurations
- Prediction testing and validation
from agent_eval.core import EvaluationEngine, AgentOutput
from agent_eval.core.types import EvaluationResult, EvaluationSummary
# Example agent outputs (replace with your actual agent data)
agent_data = [
{"output": "The transaction is approved.", "metadata": {"scenario": "finance_scenario_1"}},
{"output": "Access denied due to security policy.", "metadata": {"scenario": "security_scenario_3"}}
]
agent_outputs = [AgentOutput.from_raw(data) for data in agent_data]
# Initialize the evaluation engine for a specific domain
engine = EvaluationEngine(domain="finance")
# Run evaluation
# You can optionally pass specific scenarios if needed, otherwise it uses the domain's default pack
results: list[EvaluationResult] = engine.evaluate(agent_outputs=agent_outputs)
# Get a summary of the results
summary: EvaluationSummary = engine.get_summary(results)
print(f"Total Scenarios: {summary.total_scenarios}")
print(f"Passed: {summary.passed}")
print(f"Failed: {summary.failed}")
print(f"Pass Rate: {summary.pass_rate:.2f}%")
for result in results:
if not result.passed:
print(f"Failed Scenario: {result.scenario_name}, Reason: {result.failure_reason}")
See the GitHub Actions workflow example.
Pro Tip: Use
--quick-startfor instant demo evaluation with sample data.Note: ARC-Eval auto-detects many common agent output formats—no need to reformat your logs.
Warning: For agent-as-a-judge evaluation, you must set your API key (see above for details).
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents
We found that arc-eval demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.