New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

evalview

Package Overview
Dependencies
Maintainers
1
Versions
9
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

evalview

Drop-in Node.js middleware for EvalView — the open-source regression testing framework for AI agents. Golden baseline diffing and CI/CD integration for LangGraph, CrewAI, OpenAI, and Anthropic agents.

latest
Source
npmnpm
Version
0.6.1
Version published
Maintainers
1
Created
Source

EvalView
Regression testing for AI agents.
Snapshot behavior, detect regressions, block broken agents before production.

PyPI version PyPI downloads GitHub stars CI License

EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.

EvalView — multi-turn execution trace with sequence diagram

Quick Start

pip install evalview

Already have a local agent running?

evalview init        # Detect agent, create starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change

No agent yet?

evalview demo        # See regression detection live (~30 seconds, no API key)

Want a real working agent?

Starter repo: evalview-support-automation-template
An LLM-backed support automation agent with built-in EvalView regression tests.

git clone https://github.com/hidai25/evalview-support-automation-template
cd evalview-support-automation-template
make run

Other entry paths:

# Generate tests from a live agent
evalview generate --agent http://localhost:8000

# Generate from existing logs
evalview generate --from-log traffic.jsonl

# Capture real user flows via proxy
evalview capture --agent http://localhost:8000/invoke

How It Works

┌────────────┐      ┌──────────┐      ┌──────────────┐
│ Test Cases  │ ──→  │ EvalView │ ──→  │  Your Agent   │
│   (YAML)   │      │          │ ←──  │ local / cloud │
└────────────┘      └──────────┘      └──────────────┘
  • evalview init — detects your running agent, creates a starter test suite
  • evalview snapshot — runs tests, saves traces as golden baselines
  • evalview check — replays tests, diffs against baselines, flags regressions
  • evalview monitor — runs checks continuously with optional Slack alerts

Your data stays local by default. Nothing leaves your machine unless you opt in to cloud sync via evalview login.

Two Modes, One CLI

EvalView has two complementary ways to test your agent:

Regression Gating — "Did my agent change?"

Snapshot known-good behavior, then detect when something drifts.

evalview snapshot           # Capture current behavior as golden baseline
evalview check              # Compare against baseline after every change
evalview monitor            # Continuous checks with Slack alerts

Evaluation — "How good is my agent?"

Auto-generate tests and score your agent's quality right now.

evalview generate           # LLM generates realistic tests from your agent
evalview run                # Execute tests, score with LLM judge, get HTML report

Both modes start the same way: evalview demoevalview init → then pick your path.

What It Catches

StatusMeaningAction
PASSEDBehavior matches baselineShip with confidence
⚠️ TOOLS_CHANGEDDifferent tools calledReview the diff
⚠️ OUTPUT_CHANGEDSame tools, output shiftedReview the diff
REGRESSIONScore dropped significantlyFix before shipping

Four Scoring Layers

LayerWhat it checksNeeds API key?Cost
Tool calls + sequenceExact tool names, order, parametersNoFree
Code-based checksRegex, JSON schema, contains/not_containsNoFree
Semantic similarityOutput meaning via embeddingsOPENAI_API_KEY~$0.00004/test
LLM-as-judgeOutput quality scored by GPTOPENAI_API_KEY~$0.01/test

The first two layers alone catch most regressions — fully offline, zero cost.

Multi-Turn Testing

name: refund-needs-order-number
turns:
  - query: "I want a refund"
    expected:
      output:
        contains: ["order number"]
  - query: "Order 4812"
    expected:
      tools: ["lookup_order", "check_policy"]
thresholds:
  min_score: 70

If the agent stops asking for the order number or takes a different tool path on the follow-up, EvalView flags it.

Key Features

FeatureDescriptionDocs
Golden baseline diffingTool call + parameter + output regression detectionDocs
Multi-turn testingSequential turns with context injectionDocs
Multi-reference goldensUp to 5 variants for non-deterministic agentsDocs
forbidden_toolsSafety contracts — hard-fail on any violationDocs
Semantic similarityEmbedding-based output comparisonDocs
Production monitoringevalview monitor with Slack alerts and JSONL historyDocs
A/B comparisonevalview compare --v1 <url> --v2 <url>Docs
Test generationevalview generate — auto-create test suitesDocs
Silent model detectionAlerts when LLM provider updates the model versionDocs
Gradual drift detectionTrend analysis across check historyDocs
Statistical mode (pass@k)Run N times, require a pass rateDocs
HTML trace replayStep-by-step forensic debuggingDocs
Pytest pluginevalview_check fixture for standard pytestDocs
Git hooksPre-push regression blocking, zero CI configDocs
LLM judge caching~80% cost reduction in statistical modeDocs
Skills testingE2E testing for Claude Code, Codex, OpenClawDocs

Supported Frameworks

Works with LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API.

AgentE2E TestingTrace Capture
LangGraph
CrewAI
OpenAI Assistants
Claude Code
Ollama
Any HTTP API

Framework details → | Flagship starter → | Starter examples →

CI/CD Integration

evalview install-hooks    # Pre-push regression blocking, zero config

Or in GitHub Actions:

# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.6.1
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          command: check
          fail-on: 'REGRESSION'

Full CI/CD guide →

Production Monitoring

evalview monitor                                         # Check every 5 min
evalview monitor --interval 60                           # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl                 # JSONL for dashboards

New regressions trigger Slack alerts. Recoveries send all-clear. No spam on persistent failures.

Monitor config options →

Pytest Plugin

def test_weather_regression(evalview_check):
    diff = evalview_check("weather-lookup")
    assert diff.overall_severity.value != "regression", diff.summary()
pip install evalview    # Plugin registers automatically
pytest                  # Runs alongside your existing tests

Claude Code (MCP)

claude mcp add --transport stdio evalview -- evalview mcp serve

8 tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report

MCP setup details
# 1. Install
pip install evalview

# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve

# 3. Make Claude Code proactive
cp CLAUDE.md.example CLAUDE.md

Then just ask Claude: "did my refactor break anything?" and it runs run_check inline.

Why EvalView?

LangSmithBraintrustPromptfooEvalView
Primary focusObservabilityScoringPrompt comparisonRegression detection
Tool call + parameter diffingYes
Golden baseline regressionManualAutomatic
Works without API keysNoNoPartialYes
Production monitoringTracingCheck loop + Slack

Detailed comparisons →

Documentation

Getting StartedCore FeaturesIntegrations
Getting StartedGolden TracesCI/CD
CLI ReferenceEvaluation MetricsMCP Contracts
FAQTest GenerationSkills Testing
YAML SchemaStatistical ModeChat Mode
Framework SupportBehavior CoverageDebugging

Contributing

License: Apache 2.0

Star History

Star History Chart

Keywords

evalview

FAQs

Package last updated on 28 Mar 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts