New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details
Socket
Book a DemoSign in
Socket

arc-eval

Package Overview
Dependencies
Maintainers
1
Versions
11
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

arc-eval

ARC-Eval: Agent Reliability & Compliance evaluation platform for LLMs and AI agents

pipPyPI
Version
0.2.9
Maintainers
1
ARC-Eval Banner

ARC-Eval: Debug, Evaluate, and Improve AI Agents

CLI-native, framework-agnostic evaluation for AI agents. Real-world reliability, compliance, and risk - improve performance with every run.

PyPI version License: MIT Python 3.9+ Ask DeepWiki


ARC-Eval CLI Dashboard


ARC-Eval tests any agent—regardless of framework—against 378 enterprise-grade scenarios in finance, security, and machine learning. Instantly spot risks like data leaks, bias, or compliance gaps. With four simple CLI workflows, ARC-Eval delivers actionable insights, continuous improvement, and audit-ready reports—no code changes required.

It's built to be agent-agnostic, meaning you can bring your own agent (BYOA) regardless of the framework (LangChain, OpenAI, Google, Agno, etc.) and get actionable insights with minimal setup.

Table of Contents

⚡ Quick Start (2 minutes)

💡 Pro Tip: Use --quick-start for an instant, no-setup demo. See Flexible Input & Auto-Detection for all ingestion options.

# 1. Install ARC-Eval (Python 3.9+ required)
pip install arc-eval

# 2. Try it instantly with sample data (no local files needed!)
arc-eval compliance --domain finance --quick-start

# 3. See all available commands and options
arc-eval --help
Next Steps with Your Agent

⚠️ Important: For agent-as-judge evaluation, set your API key:
export ANTHROPIC_API_KEY="your-anthropic-api-key"
See Flexible Input & Auto-Detection for details.

# For agent-as-judge evaluation (optional but highly recommended for deeper insights)
export ANTHROPIC_API_KEY="your-anthropic-api-key" # Or your preferred LLM provider API key

# For any agent framework - just point to your output file
arc-eval debug --input your_agent_trace.json
arc-eval compliance --domain security --input your_agent_outputs.json
arc-eval improve --from-evaluation latest

# Or get the complete picture in one command
arc-eval analyze --input your_agent_outputs.json --domain finance

# Get guided help and explore workflows anytime from the interactive menu
arc-eval

Core Workflows

See Scenario Libraries & Regulations for full coverage details.
See Flexible Input & Auto-Detection for all ingestion options.

Debug: "Why is my agent failing?"
arc-eval debug --input your_agent_trace.json
Compliance: "Does my agent meet requirements?"
arc-eval compliance --domain finance --input your_agent_outputs.json

# Or try it instantly with sample data (no local files or API keys needed!)
arc-eval compliance --domain security --quick-start
Improve: "How do I make it better?"
arc-eval improve --from-evaluation latest # Uses insights from your last evaluation
Analyze: "Give me the complete picture"
arc-eval analyze --input your_agent_outputs.json --domain finance

This command automatically runs the debug process, then the compliance checks, and finally presents the interactive menu for next steps.

Key Features & Dashboard

Agent-as-a-Judge: ARC-Eval uses LLMs as domain-specific judges to evaluate agent outputs, provide continuous feedback, and drive improvement. This is implemented in agent_eval/evaluation/judges/, with feedback and retraining handled by agent_eval/analysis/self_improvement.py and adaptive scenario generation in agent_eval/core/scenario_bank.py.

Interactive Menus

After each workflow (like debug or compliance), see an interactive menu guiding you to logical next steps, making it easy to navigate the platform's capabilities.

🔍 What would you like to do?
════════════════════════════════════════

  [1]  Run compliance check on these outputs      (Recommended)
  [2]  Ask questions about failures               (Interactive Mode)
  [3]  Export debug report                        (PDF/CSV/JSON)
  [4]  View learning dashboard & submit patterns  (Improve ARC-Eval)

Learning Dashboard

The system tracks failure patterns and improvements over time, providing valuable insights:

  • Pattern Library: Captures and catalogs recurring failure patterns from your agent's runs.
  • Fix Catalog: Suggests specific code fixes or configuration changes for common, identified issues.
  • Performance Delta: Clearly shows improvement metrics (e.g., compliance pass rate increasing from 73% → 91% after applying fixes).

Versatile Export Options

Easily share or archive your findings:

  • PDF: Professional, auditable reports ideal for compliance teams and stakeholder reviews.
  • CSV: Raw data suitable for spreadsheet analysis or custom charting.
  • JSON: Structured data perfect for integration with monitoring systems, CI/CD pipelines, or other internal tools.

Flexible Input & Auto-Detection

ARC-Eval is designed to seamlessly fit into your existing workflows with flexible input methods and intelligent format detection.

Multiple Ways to Provide Agent Outputs

You can feed your agent's traces and outputs to ARC-Eval using several convenient methods:

# 1. Direct file input (most common)
arc-eval compliance --domain finance --input your_agent_traces.json

# 2. Auto-scan current directory for JSON files
# Ideal when you have multiple trace files in a folder.
arc-eval compliance --domain finance --folder-scan

# 3. Paste traces directly from your clipboard
# Useful for quick, one-off evaluations. (Requires pyperclip: pip install pyperclip)
arc-eval compliance --domain finance --input clipboard

# 4. Instant demo with built-in sample data (no files needed!)
arc-eval compliance --domain finance --quick-start

# 5. For automation/CI-CD (skips interactive prompts)
arc-eval compliance --domain finance --input your_agent_traces.json --no-interactive

Performance Optimization

For faster evaluation, include scenario_id in your agent outputs to limit evaluation to specific scenarios:

{
  "output": "Transaction approved after KYC verification",
  "scenario_id": "fin_001"
}

Performance Tips:

  • Include scenario_id: Limits evaluation to specific scenarios (10x faster)
  • Use --no-interactive: Essential for automation and CI/CD
  • Use --quick-start: Instant demo with built-in sample data
  • Batch processing: Automatically enabled for 5+ scenarios (50% cost savings)

Automatic Format Detection

No need to reformat your agent logs. ARC-Eval automatically detects and parses outputs from many common agent frameworks and LLM API responses. Just point ARC-Eval to your data, and it will handle the rest.

Examples of auto-detected formats:

// Simple, generic format (works with any custom agent)
{
  "output": "Transaction approved for account X9876.",
  "scenario_id": "fin_001", // Optional: for faster evaluation (limits to specific scenarios)
  "error": null, // Optional: include if an error occurred
  "metadata": {"user_id": "user123", "timestamp": "2024-05-27T10:30:00Z"} // Optional metadata
}

// OpenAI / Anthropic API style logs (and similar LLM provider formats)
{
  "id": "msg_abc123",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "tool_calls": [ // If your agent uses tools
          {"id": "call_def456", "type": "function", "function": {"name": "get_capital_city", "arguments": "{\"country\": \"France\"}"}}
        ]
      }
    }
  ],
  "usage": {"prompt_tokens": 50, "completion_tokens": 10}
}

// LangChain / CrewAI / LangGraph style traces (capturing intermediate steps)
{
  "input": "What is the weather in London?",
  "intermediate_steps": [
    [
      {
        "tool": "weather_api",
        "tool_input": "London",
        "log": "Invoking weather_api with London\n"
      },
      "Rainy, 10°C"
    ]
  ],
  "output": "The weather in London is Rainy, 10°C.",
  "metadata": {"run_id": "run_789"}
}

ARC-Eval intelligently extracts the core agent response, tool calls, and relevant metadata for evaluation. For adding custom parsers, see agent_eval/core/parser_registry.py.

Scenario Libraries & Regulations

Show Scenario Libraries & Regulations

Comprehensive Enterprise Test Suite

ARC-Eval provides 378 enterprise scenarios across three critical domains:

DomainScenariosKey RegulationsUse Cases
Finance110 scenariosSOX, KYC/AML, PCI-DSS, GDPR, EU AI ActFinancial reporting, fraud detection, loan processing
Security120 scenariosOWASP LLM Top 10, NIST AI RMF, ISO 27001Prompt injection, data leakage, model theft
ML/AI148 scenariosEU AI Act, IEEE P7000, Model CardsBias detection, explainability, model governance

Real-World Enterprise Scenarios

  • SOX Compliance: Detect earnings manipulation in SEC filings
  • AML/KYC: Identify synthetic identities and money laundering patterns
  • EU AI Act: Assess high-risk AI system compliance and bias detection
  • OWASP LLM: Test for prompt injection, training data poisoning, model theft
  • Data Privacy: GDPR, CCPA compliance validation and PII detection

Why This Matters: While many evaluation platforms focus on "helpfulness" and "harmlessness," ARC-Eval specializes in regulatory compliance and enterprise risk scenarios that can result in significant fines or regulatory action.

How It Works

graph LR
    A[Agent Output] --> B[Debug]
    B --> C[Compliance]
    C --> D[Dashboard & Report]
    D --> E[Improve]
    E --> F[Re-evaluate]
    F --> B

The Arc Loop: ARC-Eval learns from every failure to build smarter, more reliable agents.

  • Debug: You provide your agent's output (trace). ARC-Eval finds what's broken (errors, inefficiencies) and can suggest running a compliance check for deeper analysis.
  • Compliance: Your agent is measured against hundreds of real-world scenarios. The results populate the Learning Dashboard and generate a compliance report.
  • Dashboard & Report: Track your agent's learning progress, identify recurring patterns, and see compliance gaps. This insight guides the improvement plan.
  • Improve: Based on the findings, ARC-Eval generates prioritized fixes and an improvement plan, then prompts re-evaluation.
  • Re-evaluate: Test the suggested improvements. The new performance data feeds back into the loop, enabling continuous refinement.

📖 Complete Implementation Guide: See Core Product Loops Documentation for detailed step-by-step instructions on implementing The Arc Loop and Data Flywheel in your development workflow.

🔄 ARC-Eval Data Flywheel: Complete End-to-End Flow

📊 Core Architecture Overview

  Static Domain Knowledge → Dynamic Learning → Performance Analysis → Adaptive Improvement
          ↓                      ↓                    ↓                      ↓
     finance.yaml          ScenarioBank      SelfImprovementEngine    FlywheelExperiment
     (110 scenarios)    (pattern learning)   (performance tracking)    (ACL curriculum)
          ↑                      ↑                    ↑                      ↑
          └──────────────── Continuous Feedback Loop ────────────────────────┘

Examples & Integrations

Show Examples & Integrations

📚 Complete Documentation: See docs/ for comprehensive guides including:

🔧 Practical Examples: Explore examples/ for:

  • Framework-specific integration examples
  • CI/CD pipeline templates
  • Sample agent outputs and configurations
  • Prediction testing and validation

Advanced Usage

Show Advanced Usage (SDK, CI/CD, etc.)

Python SDK for Programmatic Evaluation

from agent_eval.core import EvaluationEngine, AgentOutput
from agent_eval.core.types import EvaluationResult, EvaluationSummary

# Example agent outputs (replace with your actual agent data)
agent_data = [
    {"output": "The transaction is approved.", "metadata": {"scenario": "finance_scenario_1"}},
    {"output": "Access denied due to security policy.", "metadata": {"scenario": "security_scenario_3"}}
]
agent_outputs = [AgentOutput.from_raw(data) for data in agent_data]

# Initialize the evaluation engine for a specific domain
engine = EvaluationEngine(domain="finance")

# Run evaluation
# You can optionally pass specific scenarios if needed, otherwise it uses the domain's default pack
results: list[EvaluationResult] = engine.evaluate(agent_outputs=agent_outputs)

# Get a summary of the results
summary: EvaluationSummary = engine.get_summary(results)

print(f"Total Scenarios: {summary.total_scenarios}")
print(f"Passed: {summary.passed}")
print(f"Failed: {summary.failed}")
print(f"Pass Rate: {summary.pass_rate:.2f}%")

for result in results:
    if not result.passed:
        print(f"Failed Scenario: {result.scenario_name}, Reason: {result.failure_reason}")

See the GitHub Actions workflow example.

Tips & Troubleshooting

Pro Tip: Use --quick-start for instant demo evaluation with sample data.

Note: ARC-Eval auto-detects many common agent output formats—no need to reformat your logs.

Warning: For agent-as-a-judge evaluation, you must set your API key (see above for details).

This project is licensed under the MIT License - see the LICENSE file for details.

Keywords

ai

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts