
Security News
Axios Maintainer Confirms Social Engineering Attack Behind npm Compromise
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.
Drop-in Node.js middleware for EvalView — the open-source regression testing framework for AI agents. Golden baseline diffing and CI/CD integration for LangGraph, CrewAI, OpenAI, and Anthropic agents.
Regression testing for AI agents.
Snapshot behavior, detect regressions, block broken agents before production.
EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.
✓ login-flow PASSED
⚠ refund-request TOOLS_CHANGED
- lookup_order → check_policy → process_refund
+ lookup_order → check_policy → process_refund → escalate_to_human
✗ billing-dispute REGRESSION -30 pts
Score: 85 → 55 Output similarity: 35%
Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.
pip install evalview
Already have a local agent running?
evalview init # Detect agent, create starter suite
evalview snapshot # Save current behavior as baseline
evalview check # Catch regressions after every change
No agent yet?
evalview demo # See regression detection live (~30 seconds, no API key)
Want a real working agent?
Starter repo: evalview-support-automation-template
An LLM-backed support automation agent with built-in EvalView regression tests.
git clone https://github.com/hidai25/evalview-support-automation-template
cd evalview-support-automation-template
make run
Other entry paths:
# Generate tests from a live agent
evalview generate --agent http://localhost:8000
# Generate from existing logs
evalview generate --from-log traffic.jsonl
# Capture real user flows via proxy
evalview capture --agent http://localhost:8000/invoke
┌────────────┐ ┌──────────┐ ┌──────────────┐
│ Test Cases │ ──→ │ EvalView │ ──→ │ Your Agent │
│ (YAML) │ │ │ ←── │ local / cloud │
└────────────┘ └──────────┘ └──────────────┘
evalview init — detects your running agent, creates a starter test suiteevalview snapshot — runs tests, saves traces as golden baselinesevalview check — replays tests, diffs against baselines, flags regressionsevalview monitor — runs checks continuously with optional Slack alertsYour data stays local by default. Nothing leaves your machine unless you opt in to cloud sync via evalview login.
EvalView has two complementary ways to test your agent:
Snapshot known-good behavior, then detect when something drifts.
evalview snapshot # Capture current behavior as golden baseline
evalview check # Compare against baseline after every change
evalview monitor # Continuous checks with Slack alerts
Auto-generate tests and score your agent's quality right now.
evalview generate # LLM generates realistic tests from your agent
evalview run # Execute tests, score with LLM judge, get HTML report
Both modes start the same way: evalview demo → evalview init → then pick your path.
| Status | Meaning | Action |
|---|---|---|
| ✅ PASSED | Behavior matches baseline | Ship with confidence |
| ⚠️ TOOLS_CHANGED | Different tools called | Review the diff |
| ⚠️ OUTPUT_CHANGED | Same tools, output shifted | Review the diff |
| ❌ REGRESSION | Score dropped significantly | Fix before shipping |
| Layer | What it checks | Needs API key? | Cost |
|---|---|---|---|
| Tool calls + sequence | Exact tool names, order, parameters | No | Free |
| Code-based checks | Regex, JSON schema, contains/not_contains | No | Free |
| Semantic similarity | Output meaning via embeddings | OPENAI_API_KEY | ~$0.00004/test |
| LLM-as-judge | Output quality scored by GPT | OPENAI_API_KEY | ~$0.01/test |
The first two layers alone catch most regressions — fully offline, zero cost.
name: refund-needs-order-number
turns:
- query: "I want a refund"
expected:
output:
contains: ["order number"]
- query: "Order 4812"
expected:
tools: ["lookup_order", "check_policy"]
thresholds:
min_score: 70
If the agent stops asking for the order number or takes a different tool path on the follow-up, EvalView flags it.
| Feature | Description | Docs |
|---|---|---|
| Golden baseline diffing | Tool call + parameter + output regression detection | Docs |
| Multi-turn testing | Sequential turns with context injection | Docs |
| Multi-reference goldens | Up to 5 variants for non-deterministic agents | Docs |
forbidden_tools | Safety contracts — hard-fail on any violation | Docs |
| Semantic similarity | Embedding-based output comparison | Docs |
| Production monitoring | evalview monitor with Slack alerts and JSONL history | Docs |
| A/B comparison | evalview compare --v1 <url> --v2 <url> | Docs |
| Test generation | evalview generate — auto-create test suites | Docs |
| Silent model detection | Alerts when LLM provider updates the model version | Docs |
| Gradual drift detection | Trend analysis across check history | Docs |
| Statistical mode (pass@k) | Run N times, require a pass rate | Docs |
| HTML trace replay | Step-by-step forensic debugging | Docs |
| Pytest plugin | evalview_check fixture for standard pytest | Docs |
| Git hooks | Pre-push regression blocking, zero CI config | Docs |
| LLM judge caching | ~80% cost reduction in statistical mode | Docs |
| Skills testing | E2E testing for Claude Code, Codex, OpenClaw | Docs |
Works with LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API.
| Agent | E2E Testing | Trace Capture |
|---|---|---|
| LangGraph | ✅ | ✅ |
| CrewAI | ✅ | ✅ |
| OpenAI Assistants | ✅ | ✅ |
| Claude Code | ✅ | ✅ |
| Ollama | ✅ | ✅ |
| Any HTTP API | ✅ | ✅ |
Framework details → | Flagship starter → | Starter examples →
evalview install-hooks # Pre-push regression blocking, zero config
Or in GitHub Actions:
# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hidai25/eval-view@v0.6.1
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
command: check
fail-on: 'REGRESSION'
evalview monitor # Check every 5 min
evalview monitor --interval 60 # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl # JSONL for dashboards
New regressions trigger Slack alerts. Recoveries send all-clear. No spam on persistent failures.
def test_weather_regression(evalview_check):
diff = evalview_check("weather-lookup")
assert diff.overall_severity.value != "regression", diff.summary()
pip install evalview # Plugin registers automatically
pytest # Runs alongside your existing tests
claude mcp add --transport stdio evalview -- evalview mcp serve
8 tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report
# 1. Install
pip install evalview
# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve
# 3. Make Claude Code proactive
cp CLAUDE.md.example CLAUDE.md
Then just ask Claude: "did my refactor break anything?" and it runs run_check inline.
| LangSmith | Braintrust | Promptfoo | EvalView | |
|---|---|---|---|---|
| Primary focus | Observability | Scoring | Prompt comparison | Regression detection |
| Tool call + parameter diffing | — | — | — | Yes |
| Golden baseline regression | — | Manual | — | Automatic |
| Works without API keys | No | No | Partial | Yes |
| Production monitoring | Tracing | — | — | Check loop + Slack |
evalview feedback or open an issueLicense: Apache 2.0
FAQs
Drop-in Node.js middleware for EvalView — the open-source regression testing framework for AI agents. Golden baseline diffing and CI/CD integration for LangGraph, CrewAI, OpenAI, and Anthropic agents.
We found that evalview demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.

Security News
The Axios compromise shows how time-dependent dependency resolution makes exposure harder to detect and contain.