Latest Socket ResearchMalicious Chrome Extension Performs Hidden Affiliate Hijacking.Details
Socket
Book a DemoInstallSign in
Socket

@vercel/agent-eval

Package Overview
Dependencies
Maintainers
369
Versions
4
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@vercel/agent-eval

Framework for testing AI coding agents in isolated sandboxes

Source
npmnpm
Version
0.0.3
Version published
Weekly downloads
425
Maintainers
369
Weekly downloads
 
Created
Source

@vercel/agent-eval

Test AI coding agents on your framework. Measure what actually works.

Why?

You're building a frontend framework and want AI agents to work well with it. But how do you know if:

  • Your documentation helps agents write correct code?
  • Adding an MCP server improves agent success rates?
  • Sonnet performs as well as Opus for your use cases?
  • Your latest API changes broke agent compatibility?

This framework gives you answers. Run controlled experiments, measure pass rates, compare techniques.

Quick Start

# Create a new eval project
npx @vercel/agent-eval init my-framework-evals
cd my-framework-evals

# Install dependencies
npm install

# Add your API keys
cp .env.example .env
# Edit .env with your AI_GATEWAY_API_KEY and VERCEL_TOKEN

# Preview what will run (no API calls, no cost)
npx @vercel/agent-eval cc --dry

# Run the evals
npx @vercel/agent-eval cc

A/B Testing AI Techniques

The real power is comparing different approaches. Create multiple experiment configs:

Control: Baseline Agent

// experiments/control.ts
import type { ExperimentConfig } from 'agent-eval';

const config: ExperimentConfig = {
  agent: 'vercel-ai-gateway/claude-code',
  model: 'opus',
  runs: 10,        // Multiple runs for statistical significance
  earlyExit: false, // Run all attempts to measure reliability
};

export default config;

Treatment: Agent with MCP Server

// experiments/with-mcp.ts
import type { ExperimentConfig } from 'agent-eval';

const config: ExperimentConfig = {
  agent: 'vercel-ai-gateway/claude-code',
  model: 'opus',
  runs: 10,
  earlyExit: false,

  setup: async (sandbox) => {
    // Install your framework's MCP server
    await sandbox.runCommand('npm', ['install', '-g', '@myframework/mcp-server']);

    // Configure Claude to use it
    await sandbox.writeFiles({
      '.claude/settings.json': JSON.stringify({
        mcpServers: {
          myframework: { command: 'myframework-mcp' }
        }
      })
    });
  },
};

export default config;

Run Both & Compare

# Preview first
npx @vercel/agent-eval control --dry
npx @vercel/agent-eval with-mcp --dry

# Run experiments
npx @vercel/agent-eval control
npx @vercel/agent-eval with-mcp

Compare results:

Control (baseline):     7/10 passed (70%)
With MCP:              9/10 passed (90%)

Creating Evals for Your Framework

Each eval tests one specific task an agent should be able to do with your framework.

Example: Testing Component Creation

evals/
  create-button-component/
    PROMPT.md           # Task for the agent
    EVAL.ts             # Tests to verify success (or EVAL.tsx for JSX)
    package.json        # Your framework as a dependency
    src/                # Starter code

EVAL.ts vs EVAL.tsx

Use EVAL.tsx when your tests require JSX syntax (React Testing Library, component testing):

// EVAL.tsx - use when testing React components
import { test, expect } from 'vitest';
import { render, screen } from '@testing-library/react';
import { Button } from './src/components/Button';

test('Button renders with label', () => {
  render(<Button label="Click me" onClick={() => {}} />);
  expect(screen.getByText('Click me')).toBeDefined();
});

Use EVAL.ts for tests that don't need JSX:

// EVAL.ts - use for file checks, build tests, etc.
import { test, expect } from 'vitest';
import { existsSync } from 'fs';

test('Button component exists', () => {
  expect(existsSync('src/components/Button.tsx')).toBe(true);
});

Note: You only need one eval file per fixture. Choose .tsx if any test needs JSX, otherwise use .ts.

PROMPT.md - What you want the agent to do:

Create a Button component using MyFramework.

Requirements:
- Export a Button component from src/components/Button.tsx
- Accept `label` and `onClick` props
- Use the framework's styling system for hover states

EVAL.ts (or EVAL.tsx) - How you verify it worked:

import { test, expect } from 'vitest';
import { readFileSync, existsSync } from 'fs';
import { execSync } from 'child_process';

test('Button component exists', () => {
  expect(existsSync('src/components/Button.tsx')).toBe(true);
});

test('has required props', () => {
  const content = readFileSync('src/components/Button.tsx', 'utf-8');
  expect(content).toContain('label');
  expect(content).toContain('onClick');
});

test('project builds', () => {
  execSync('npm run build', { stdio: 'pipe' });
});

package.json - Include your framework:

{
  "name": "create-button-component",
  "type": "module",
  "scripts": { "build": "tsc" },
  "dependencies": {
    "myframework": "^2.0.0"
  }
}

Experiment Ideas

ExperimentControlTreatment
MCP impactNo MCPWith MCP server
Model comparisonHaikuSonnet / Opus
DocumentationMinimal docsRich examples
System promptDefaultFramework-specific
Tool availabilityRead/write only+ custom tools

Configuration Reference

Agent Selection

Choose your agent and authentication method:

// Vercel AI Gateway (recommended - unified billing & observability)
agent: 'vercel-ai-gateway/claude-code'  // or 'vercel-ai-gateway/codex'

// Direct API (uses provider keys directly)
agent: 'claude-code'  // requires ANTHROPIC_API_KEY
agent: 'codex'        // requires OPENAI_API_KEY

See the Environment Variables section below for setup instructions.

Full Configuration

import type { ExperimentConfig } from 'agent-eval';

const config: ExperimentConfig = {
  // Required: which agent and authentication to use
  agent: 'vercel-ai-gateway/claude-code',

  // Model to use (defaults: 'opus' for claude-code, 'openai/gpt-5.2-codex' for codex)
  model: 'opus',

  // How many times to run each eval
  runs: 10,

  // Stop after first success? (false for reliability measurement)
  earlyExit: false,

  // npm scripts that must pass after agent finishes
  scripts: ['build', 'lint'],

  // Timeout per run in seconds
  timeout: 300,

  // Filter which evals to run (pick one)
  evals: '*',                                // all (default)
  // evals: ['specific-eval'],              // by name
  // evals: (name) => name.startsWith('api-'), // by function

  // Setup function for environment configuration
  setup: async (sandbox) => {
    await sandbox.writeFiles({ '.env': 'API_KEY=test' });
    await sandbox.runCommand('npm', ['run', 'setup']);
  },
};

export default config;

CLI Commands

init <name>

Create a new eval project:

npx @vercel/agent-eval init my-evals

<experiment>

Run an experiment:

npx @vercel/agent-eval cc

Dry run - preview without executing (no API calls, no cost):

npx @vercel/agent-eval cc --dry

# Output:
# Found 5 valid fixture(s), will run 5:
#   - create-button
#   - add-routing
#   - setup-state
#   - ...
# Running 5 eval(s) x 10 run(s) = 50 total runs
# Agent: claude-code, Model: opus, Timeout: 300s
# [DRY RUN] Would execute evals here

Results

Results are saved to results/<experiment>/<timestamp>/:

results/
  with-mcp/
    2026-01-27T10-30-00Z/
      experiment.json       # Config and summary
      create-button/
        summary.json        # { totalRuns: 10, passedRuns: 9, passRate: "90%" }
        run-1/
          result.json       # Individual run result
          transcript.jsonl  # Agent conversation
          outputs/          # Test/script output

Analyzing Results

# Quick comparison
cat results/control/*/experiment.json | jq '.evals[] | {name, passRate}'
cat results/with-mcp/*/experiment.json | jq '.evals[] | {name, passRate}'
Pass RateInterpretation
90-100%Agent handles this reliably
70-89%Usually works, room for improvement
50-69%Unreliable, needs investigation
< 50%Task too hard or prompt needs work

Environment Variables

Every run requires two things: an API key for the agent and a token for the Vercel sandbox. The exact variables depend on which authentication mode you use.

VariableRequired whenDescription
AI_GATEWAY_API_KEYagent: 'vercel-ai-gateway/...'Vercel AI Gateway key — works for all agents
ANTHROPIC_API_KEYagent: 'claude-code'Direct Anthropic API key (sk-ant-...)
OPENAI_API_KEYagent: 'codex'Direct OpenAI API key (sk-proj-...)
VERCEL_TOKENAlways (pick one)Vercel personal access token — for local dev
VERCEL_OIDC_TOKENAlways (pick one)Vercel OIDC token — for CI/CD pipelines

You always need one agent key + one sandbox token.

Use vercel-ai-gateway/ prefixed agents. One key for all models.

# Agent access — get yours at https://vercel.com/dashboard -> AI Gateway
AI_GATEWAY_API_KEY=your-ai-gateway-api-key

# Sandbox access — create at https://vercel.com/account/tokens
VERCEL_TOKEN=your-vercel-token
# OR for CI/CD:
# VERCEL_OIDC_TOKEN=your-oidc-token

Direct API Keys (Alternative)

Remove the vercel-ai-gateway/ prefix and use provider keys directly:

# For agent: 'claude-code'
ANTHROPIC_API_KEY=sk-ant-...

# For agent: 'codex'
OPENAI_API_KEY=sk-proj-...

# Sandbox access is still required
VERCEL_TOKEN=your-vercel-token

.env Setup

The init command generates a .env.example file. Copy it and fill in your keys:

cp .env.example .env

The framework loads .env automatically via dotenv.

Vercel Employees

To get the environment variables, link to vercel-labs/agent-eval on Vercel:

# Link to the vercel-labs/agent-eval project
vc link vercel-labs/agent-eval

# Pull environment variables
vc env pull

This writes a .env.local file with all the required environment variables (AI_GATEWAY_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY, VERCEL_OIDC_TOKEN) — no manual key setup needed. The framework automatically loads from both .env and .env.local.

Tips

Start with --dry: Always preview before running to verify your config and avoid unexpected costs.

Use multiple runs: Single runs don't tell you reliability. Use runs: 10 and earlyExit: false for meaningful data.

Isolate variables: Change one thing at a time between experiments. Don't compare "Opus with MCP" to "Haiku without MCP".

Test incrementally: Start with simple tasks, add complexity as you learn what works.

License

MIT

Keywords

ai

FAQs

Package last updated on 29 Jan 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts