@vercel/agent-eval
Test AI coding agents on your framework. Measure what actually works.
Why?
You're building a frontend framework and want AI agents to work well with it. But how do you know if:
- Your documentation helps agents write correct code?
- Adding an MCP server improves agent success rates?
- Sonnet performs as well as Opus for your use cases?
- Your latest API changes broke agent compatibility?
This framework gives you answers. Run controlled experiments, measure pass rates, compare techniques.
Quick Start
npx @vercel/agent-eval init my-framework-evals
cd my-framework-evals
npm install
cp .env.example .env
npx @vercel/agent-eval cc --dry
npx @vercel/agent-eval cc
A/B Testing AI Techniques
The real power is comparing different approaches. Create multiple experiment configs:
Control: Baseline Agent
import type { ExperimentConfig } from 'agent-eval';
const config: ExperimentConfig = {
agent: 'vercel-ai-gateway/claude-code',
model: 'opus',
runs: 10,
earlyExit: false,
};
export default config;
Treatment: Agent with MCP Server
import type { ExperimentConfig } from 'agent-eval';
const config: ExperimentConfig = {
agent: 'vercel-ai-gateway/claude-code',
model: 'opus',
runs: 10,
earlyExit: false,
setup: async (sandbox) => {
await sandbox.runCommand('npm', ['install', '-g', '@myframework/mcp-server']);
await sandbox.writeFiles({
'.claude/settings.json': JSON.stringify({
mcpServers: {
myframework: { command: 'myframework-mcp' }
}
})
});
},
};
export default config;
Run Both & Compare
npx @vercel/agent-eval control --dry
npx @vercel/agent-eval with-mcp --dry
npx @vercel/agent-eval control
npx @vercel/agent-eval with-mcp
Compare results:
Control (baseline): 7/10 passed (70%)
With MCP: 9/10 passed (90%)
Creating Evals for Your Framework
Each eval tests one specific task an agent should be able to do with your framework.
Example: Testing Component Creation
evals/
create-button-component/
PROMPT.md # Task for the agent
EVAL.ts # Tests to verify success (or EVAL.tsx for JSX)
package.json # Your framework as a dependency
src/ # Starter code
EVAL.ts vs EVAL.tsx
Use EVAL.tsx when your tests require JSX syntax (React Testing Library, component testing):
import { test, expect } from 'vitest';
import { render, screen } from '@testing-library/react';
import { Button } from './src/components/Button';
test('Button renders with label', () => {
render(<Button label="Click me" onClick={() => {}} />);
expect(screen.getByText('Click me')).toBeDefined();
});
Use EVAL.ts for tests that don't need JSX:
import { test, expect } from 'vitest';
import { existsSync } from 'fs';
test('Button component exists', () => {
expect(existsSync('src/components/Button.tsx')).toBe(true);
});
Note: You only need one eval file per fixture. Choose .tsx if any test needs JSX, otherwise use .ts.
PROMPT.md - What you want the agent to do:
Create a Button component using MyFramework.
Requirements:
- Export a Button component from src/components/Button.tsx
- Accept `label` and `onClick` props
- Use the framework's styling system for hover states
EVAL.ts (or EVAL.tsx) - How you verify it worked:
import { test, expect } from 'vitest';
import { readFileSync, existsSync } from 'fs';
import { execSync } from 'child_process';
test('Button component exists', () => {
expect(existsSync('src/components/Button.tsx')).toBe(true);
});
test('has required props', () => {
const content = readFileSync('src/components/Button.tsx', 'utf-8');
expect(content).toContain('label');
expect(content).toContain('onClick');
});
test('project builds', () => {
execSync('npm run build', { stdio: 'pipe' });
});
package.json - Include your framework:
{
"name": "create-button-component",
"type": "module",
"scripts": { "build": "tsc" },
"dependencies": {
"myframework": "^2.0.0"
}
}
Experiment Ideas
| MCP impact | No MCP | With MCP server |
| Model comparison | Haiku | Sonnet / Opus |
| Documentation | Minimal docs | Rich examples |
| System prompt | Default | Framework-specific |
| Tool availability | Read/write only | + custom tools |
Configuration Reference
Agent Selection
Choose your agent and authentication method:
agent: 'vercel-ai-gateway/claude-code'
agent: 'claude-code'
agent: 'codex'
See the Environment Variables section below for setup instructions.
Full Configuration
import type { ExperimentConfig } from 'agent-eval';
const config: ExperimentConfig = {
agent: 'vercel-ai-gateway/claude-code',
model: 'opus',
runs: 10,
earlyExit: false,
scripts: ['build', 'lint'],
timeout: 300,
evals: '*',
setup: async (sandbox) => {
await sandbox.writeFiles({ '.env': 'API_KEY=test' });
await sandbox.runCommand('npm', ['run', 'setup']);
},
};
export default config;
CLI Commands
init <name>
Create a new eval project:
npx @vercel/agent-eval init my-evals
<experiment>
Run an experiment:
npx @vercel/agent-eval cc
Dry run - preview without executing (no API calls, no cost):
npx @vercel/agent-eval cc --dry
Results
Results are saved to results/<experiment>/<timestamp>/:
results/
with-mcp/
2026-01-27T10-30-00Z/
experiment.json # Config and summary
create-button/
summary.json # { totalRuns: 10, passedRuns: 9, passRate: "90%" }
run-1/
result.json # Individual run result
transcript.jsonl # Agent conversation
outputs/ # Test/script output
Analyzing Results
cat results/control/*/experiment.json | jq '.evals[] | {name, passRate}'
cat results/with-mcp/*/experiment.json | jq '.evals[] | {name, passRate}'
| 90-100% | Agent handles this reliably |
| 70-89% | Usually works, room for improvement |
| 50-69% | Unreliable, needs investigation |
| < 50% | Task too hard or prompt needs work |
Environment Variables
Every run requires two things: an API key for the agent and a token for the Vercel sandbox. The exact variables depend on which authentication mode you use.
AI_GATEWAY_API_KEY | agent: 'vercel-ai-gateway/...' | Vercel AI Gateway key — works for all agents |
ANTHROPIC_API_KEY | agent: 'claude-code' | Direct Anthropic API key (sk-ant-...) |
OPENAI_API_KEY | agent: 'codex' | Direct OpenAI API key (sk-proj-...) |
VERCEL_TOKEN | Always (pick one) | Vercel personal access token — for local dev |
VERCEL_OIDC_TOKEN | Always (pick one) | Vercel OIDC token — for CI/CD pipelines |
You always need one agent key + one sandbox token.
Vercel AI Gateway (Recommended)
Use vercel-ai-gateway/ prefixed agents. One key for all models.
AI_GATEWAY_API_KEY=your-ai-gateway-api-key
VERCEL_TOKEN=your-vercel-token
Direct API Keys (Alternative)
Remove the vercel-ai-gateway/ prefix and use provider keys directly:
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-proj-...
VERCEL_TOKEN=your-vercel-token
.env Setup
The init command generates a .env.example file. Copy it and fill in your keys:
cp .env.example .env
The framework loads .env automatically via dotenv.
Vercel Employees
To get the environment variables, link to vercel-labs/agent-eval on Vercel:
vc link vercel-labs/agent-eval
vc env pull
This writes a .env.local file with all the required environment variables (AI_GATEWAY_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY, VERCEL_OIDC_TOKEN) — no manual key setup needed. The framework automatically loads from both .env and .env.local.
Tips
Start with --dry: Always preview before running to verify your config and avoid unexpected costs.
Use multiple runs: Single runs don't tell you reliability. Use runs: 10 and earlyExit: false for meaningful data.
Isolate variables: Change one thing at a time between experiments. Don't compare "Opus with MCP" to "Haiku without MCP".
Test incrementally: Start with simple tasks, add complexity as you learn what works.
License
MIT