
Company News
Socket Named Top Sales Organization by RepVue
Socket won two 2026 Reppy Awards from RepVue, ranking in the top 5% of all sales orgs. AE Alexandra Lister shares what it's like to grow a sales career here.
@poofnew/vibe-check
Advanced tools
A comprehensive AI agent evaluation framework for testing and benchmarking LLM agents.
Built with TypeScript and optimized for Bun runtime.
Features • Installation • Quick Start • Docs • Examples • X • Discord
Building reliable AI agents is an incredibly challenging endeavor. Unlike traditional software where inputs and outputs are deterministic, AI agents operate in a complex, non-deterministic environment where the smallest changes can have unexpected and far-reaching consequences. A minor prompt modification, a slight adjustment to system instructions, or even a change in model parameters can ripple through the entire system, causing subtle failures that are difficult to detect and diagnose.
Testing AI agents presents unique challenges that traditional testing frameworks cannot adequately address. How do you validate that an agent correctly interprets user intent? How do you ensure tool invocations are appropriate and executed in the right sequence? How do you catch regressions when a prompt change breaks edge cases you didn't anticipate? These questions become exponentially more complex when dealing with multi-turn conversations, code generation, and complex routing decisions.
We built vibe-check internally to rigorously test and validate poof.new. As we iterated on prompts, refined agent behaviors, and added new capabilities, we needed a systematic way to ensure our changes didn't break existing functionality—and to catch issues before they reached production. Traditional testing approaches fell short, so we created a framework specifically designed for AI agent evaluation.
After using vibe-check extensively in our own development process, we're now open-sourcing it to help the broader AI agent development community. We believe that robust testing and evaluation frameworks are essential for building production-ready AI systems, and we hope vibe-check will help others navigate the complexities of agent development with more confidence.
Building reliable AI agents is hard. Traditional testing approaches fall short when evaluating LLM behavior, tool usage, and multi-turn interactions. vibe-check provides a comprehensive framework specifically designed for AI agent evaluation:
| Feature | vibe-check | Manual Testing | Unit Tests |
|---|---|---|---|
| Agent-specific evaluation | ✅ | ❌ | ❌ |
| Tool call validation | ✅ | ⚠️ Difficult | ❌ |
| Multi-turn conversations | ✅ | ⚠️ Manual | ❌ |
| Learning from failures | ✅ | ❌ | ❌ |
| Isolated workspaces | ✅ | ❌ | ⚠️ Manual |
| Parallel execution | ✅ | ❌ | ✅ |
| Framework agnostic | ✅ | ✅ | ⚠️ Limited |
| TypeScript-native | ✅ | N/A | ✅ |
claude-code agents, tool calls are
automatically extracted from JSONL logs# Using bun (recommended)
bun add @poofnew/vibe-check
# Using npm
npm install @poofnew/vibe-check
# Using pnpm
pnpm add @poofnew/vibe-check
bunx vibe-check init
This creates:
vibe-check.config.ts - Configuration file with agent function stub__evals__/example.eval.json - Example evaluation caseEdit vibe-check.config.ts to implement your agent function:
import { defineConfig } from "@poofnew/vibe-check";
export default defineConfig({
testDir: "./__evals__",
agent: async (prompt, context) => {
// Your agent implementation here
const result = await yourAgent.run(prompt, {
cwd: context.workingDirectory,
});
return {
output: result.text,
success: result.success,
toolCalls: result.tools,
};
},
});
Create JSON files in __evals__/ directory:
{
"id": "create-hello-file",
"name": "Create Hello File",
"description": "Test that agent can create a simple file",
"category": "code-gen",
"prompt": "Create a file called hello.ts that exports a greet function",
"targetFiles": ["hello.ts"],
"expectedPatterns": [
{
"file": "hello.ts",
"patterns": ["export", "function greet"]
}
],
"judges": ["file-existence", "pattern-match"]
}
bunx vibe-check run
Create vibe-check.config.ts in your project root:
import { defineConfig } from "@poofnew/vibe-check";
export default defineConfig({
// Required: Your agent function
agent: async (prompt, context) => {
return { output: "", success: true };
},
// Optional settings with defaults shown
agentType: "generic", // 'claude-code' | 'generic' - use 'claude-code' for automatic JSONL tool extraction
testDir: "./__evals__", // Directory containing eval cases
rubricsDir: "./__evals__/rubrics", // Directory for LLM judge rubrics
testMatch: ["**/*.eval.json"], // Glob patterns for eval files
parallel: true, // Run evals in parallel
maxConcurrency: 3, // Max concurrent evals
timeout: 300000, // Timeout per eval (ms)
maxRetries: 2, // Retry failed evals
retryDelayMs: 1000, // Initial retry delay
retryBackoffMultiplier: 2, // Exponential backoff multiplier
trials: 1, // Number of trials per eval
trialPassThreshold: 0.5, // Required pass rate for trials
verbose: false, // Verbose output
preserveWorkspaces: false, // Keep workspace dirs after eval (for debugging)
outputDir: "./__evals__/results",
// Custom judges
judges: [],
// Workspace hooks (customize workspace creation)
createWorkspace: async () => {
return { id: "my-workspace", path: "/path/to/workspace" };
},
cleanupWorkspace: async (workspace) => {
// Custom cleanup logic
},
// Lifecycle hooks
setup: async () => {},
teardown: async () => {},
beforeEach: async (evalCase) => {},
afterEach: async (result) => {},
// Learning system config
learning: {
enabled: false,
ruleOutputDir: "./prompts",
minFailuresForPattern: 2,
autoApprove: false,
},
});
The agent function receives a prompt and context, and must return an
AgentResult:
interface AgentContext {
workingDirectory: string; // Isolated temp directory for this eval
evalId: string; // Unique eval case ID
evalName: string; // Eval case name
sessionId?: string; // For multi-turn sessions
timeout?: number; // Eval timeout in ms
}
interface AgentResult {
output: string; // Agent's text output
success: boolean; // Whether agent completed successfully
toolCalls?: ToolCall[]; // Record of tool invocations
sessionId?: string; // Session ID for multi-turn
error?: Error; // Error if failed
duration?: number; // Execution time in ms
usage?: {
inputTokens: number;
outputTokens: number;
totalCostUsd?: number;
};
}
interface ToolCall {
toolName: string;
input: unknown;
output?: unknown;
isError?: boolean;
timestamp?: number; // When the tool was called
duration?: number; // How long the call took (ms)
}
basic)Simple prompt-response evaluations.
{
"id": "basic-greeting",
"name": "Basic Greeting",
"description": "Test basic response",
"category": "basic",
"prompt": "Say hello",
"expectedBehavior": "Should respond with a greeting",
"judges": []
}
tool)Validates that specific tools are invoked correctly.
{
"id": "read-file-test",
"name": "Read File Test",
"description": "Test file reading capability",
"category": "tool",
"prompt": "Read the contents of package.json",
"expectedToolCalls": [
{
"toolName": "Read",
"minCalls": 1,
"maxCalls": 3
}
],
"judges": ["tool-invocation"]
}
code-gen)Validates file creation and content patterns.
{
"id": "create-component",
"name": "Create React Component",
"description": "Test component generation",
"category": "code-gen",
"prompt": "Create a React button component in src/Button.tsx",
"targetFiles": ["src/Button.tsx"],
"expectedPatterns": [
{
"file": "src/Button.tsx",
"patterns": [
"export.*Button",
"React",
"onClick"
]
}
],
"syntaxValidation": true,
"buildVerification": false,
"judges": ["file-existence", "pattern-match"]
}
routing)Validates request routing to appropriate agents.
{
"id": "route-to-coding",
"name": "Route to Coding Agent",
"description": "Test routing for code tasks",
"category": "routing",
"prompt": "Write a sorting algorithm",
"expectedAgent": "coding",
"shouldNotRoute": ["research", "conversational"],
"judges": []
}
multi-turn)Validates multi-turn conversational flows with optional per-turn evaluation.
{
"id": "iterative-refinement",
"name": "Iterative Code Refinement",
"description": "Test multi-turn improvements",
"category": "multi-turn",
"turns": [
{
"prompt": "Create a basic add function",
"expectedBehavior": "Creates initial function",
"judges": ["file-existence"]
},
{
"prompt": "Add input validation",
"expectedBehavior": "Adds type checking",
"judges": ["pattern-match"]
},
{
"prompt": "Add JSDoc comments",
"expectedBehavior": "Documents the function"
}
],
"judges": ["syntax-validation"],
"sessionPersistence": true
}
All eval cases support these fields:
| Field | Type | Description |
|---|---|---|
id | string | Unique identifier |
name | string | Display name |
description | string | Description of what's being tested |
category | string | One of: basic, tool, code-gen, routing, multi-turn |
tags | string[] | Optional tags for filtering |
enabled | boolean | Enable/disable (default: true) |
timeout | number | Override default timeout (ms) |
trials | object | { count: number, passThreshold: number } |
Validates that expected files were created.
targetFiles exist in workspaceValidates tool call counts.
expectedToolCalls against actual callsminCalls and maxCalls constraintsminCalls to 1 if not specifiedValidates file content matches expected patterns.
expectedPatternsValidates generated code has valid syntax.
Validates that specific skills were invoked.
expectedSkills against actual skill callsminCalls constraintsEvaluate outputs using LLM with custom rubrics:
llm-code-quality - Evaluate code against code-quality.md rubricllm-response-quality - Evaluate responses against response-quality.md rubricllm-routing-quality - Evaluate routing decisionsllm-conversation-quality - Evaluate conversation qualityConfigure rubrics directory in your config:
export default defineConfig({
rubricsDir: "./__evals__/rubrics",
// ...
});
Create rubric files (e.g., code-quality.md) with evaluation criteria.
Use referenceSolution in eval cases for pairwise comparison:
{
"id": "create-function",
"category": "code-gen",
"prompt": "Create an add function",
"referenceSolution": {
"description": "A properly typed add function",
"code": "function add(a: number, b: number): number {\n return a + b;\n}"
},
"judges": ["llm-code-quality"]
}
LLM judges will compare the agent's output against the reference solution.
Create custom judges by extending BaseJudge:
import {
BaseJudge,
getJudgeRegistry,
type JudgeContext,
type JudgeResult,
type JudgeType,
} from "@poofnew/vibe-check";
class ResponseLengthJudge extends BaseJudge {
id = "response-length";
name = "Response Length Judge";
type: JudgeType = "code";
constructor(
private minLength: number = 10,
private maxLength: number = 1000,
) {
super();
}
async evaluate(context: JudgeContext): Promise<JudgeResult> {
const length = context.executionResult.output.length;
if (length < this.minLength) {
return this.createResult({
passed: false,
score: 0,
reasoning: `Response too short: ${length} chars`,
});
}
if (length > this.maxLength) {
return this.createResult({
passed: false,
score: 50,
reasoning: `Response too long: ${length} chars`,
});
}
return this.createResult({
passed: true,
score: 100,
reasoning: `Response length ${length} is acceptable`,
});
}
}
// Register globally
const registry = getJudgeRegistry();
registry.register(new ResponseLengthJudge(20, 500));
// Or add to config
export default defineConfig({
judges: [new ResponseLengthJudge(20, 500)],
// ...
});
interface JudgeContext {
evalCase: EvalCase; // The eval case being judged
executionResult: ExecutionResult;
workingDirectory: string; // Workspace path
turnIndex?: number; // For multi-turn evals
}
interface ExecutionResult {
success: boolean;
output: string;
error?: Error;
toolCalls: ToolCallRecord[];
duration: number;
numTurns?: number;
sessionId?: string;
workingDirectory?: string;
transcript?: Transcript; // Full conversation transcript
progressUpdates?: ProgressRecord[]; // Progress tracking
usage?: {
inputTokens: number;
outputTokens: number;
totalCostUsd?: number;
};
}
The learning system automatically analyzes test failures and generates prompt improvements to enhance your agent's performance over time.
Enable learning in your config:
export default defineConfig({
learning: {
enabled: true,
ruleOutputDir: "./prompts", // Where to save rules
minFailuresForPattern: 2, // Min failures to form a pattern
autoApprove: false, // Require manual review
},
// ...
});
# Run full learning iteration
vibe-check learn run --source eval
# Analyze failures without generating rules
vibe-check learn analyze --source both
# Review pending rules
vibe-check learn review
# Show learning statistics
vibe-check learn stats
# Auto-approve high-confidence rules (use with caution)
vibe-check learn run --auto-approve
prompts/
├── learned-rules.json # Approved rules
├── pending-rules.json # Rules awaiting review
├── history.json # Learning history
└── iterations/
└── iteration-{timestamp}.json
{
"ruleId": "rule-1234",
"ruleContent": "When creating files, always verify the parent directory exists before writing",
"targetSection": "file-operations",
"rationale": "Multiple failures showed agents attempting to write to non-existent directories",
"addressesPatterns": ["pattern-dir-not-found"],
"expectedImpact": {
"failureIds": ["eval-123", "eval-456"],
"confidenceScore": 0.89
},
"status": "approved"
}
autoApprove: false to review all rules manuallyminFailuresForPattern: 2 to catch recurring issuesRun the evaluation suite.
vibe-check run [options]
Options:
-c, --config <path> Path to config file
--category <categories...> Filter by category (tool, code-gen, routing, multi-turn, basic)
--tag <tags...> Filter by tag
--id <ids...> Filter by eval ID
-v, --verbose Verbose output
Examples:
# Run all evals
vibe-check run
# Run only code-gen evals
vibe-check run --category code-gen
# Run evals with specific tags
vibe-check run --tag critical --tag regression
# Run specific evals by ID
vibe-check run --id create-file --id read-file
# Verbose output
vibe-check run -v
List available eval cases.
vibe-check list [options]
Options:
-c, --config <path> Path to config file
--category <categories...> Filter by category
--tag <tags...> Filter by tag
--json Output as JSON
Initialize vibe-check in a project.
vibe-check init [options]
Options:
--typescript Create TypeScript config (default)
Learning system commands for analyzing failures and generating rules.
# Run full learning iteration
vibe-check learn run [options]
--source <source> Data source (eval, jsonl, both)
--auto-approve Auto-approve high-confidence rules
--save-pending Save rules for later review
# Analyze failures without generating rules
vibe-check learn analyze [options]
--source <source> Data source (eval, jsonl, both)
# Review pending rules
vibe-check learn review
# Show learning statistics
vibe-check learn stats
Use vibe-check programmatically in your code:
import {
defineConfig,
EvalRunner,
loadConfig,
loadEvalCases,
} from "@poofnew/vibe-check";
// Load and run
const config = await loadConfig("./vibe-check.config.ts");
const runner = new EvalRunner(config);
const result = await runner.run({
categories: ["code-gen"],
tags: ["critical"],
});
console.log(`Pass rate: ${result.passRate * 100}%`);
console.log(`Duration: ${result.duration}ms`);
// Access individual results
for (const evalResult of result.results) {
if (!evalResult.success) {
console.log(`Failed: ${evalResult.evalCase.name}`);
for (const judge of evalResult.judgeResults) {
if (!judge.passed) {
console.log(` - ${judge.judgeId}: ${judge.reasoning}`);
}
}
}
}
// Configuration
export { defaultConfig, defineConfig, loadConfig };
export type {
AgentContext,
AgentFunction,
AgentResult,
ResolvedConfig,
VibeCheckConfig,
};
// Schemas
export {
isBasicEval,
isCodeGenEval,
isMultiTurnEval,
isRoutingEval,
isToolEval,
parseEvalCase,
};
export type { CodeGenEvalCase, EvalCase, EvalCategory, ToolEvalCase /* ... */ };
// Runner
export { EvalRunner };
export type { EvalSuiteResult, RunnerOptions };
// Judges
export { BaseJudge, getJudgeRegistry, JudgeRegistry, resetJudgeRegistry };
export type { ExecutionResult, Judge, JudgeContext, JudgeResult, JudgeType };
// Harness
export { TestHarness };
export type { EvalWorkspace, HarnessOptions };
// Utils
export { groupByCategory, loadEvalCase, loadEvalCases };
// Adapters (for multi-language support)
export { PythonAgentAdapter } from "@poofnew/vibe-check/adapters";
export type {
AgentRequest,
AgentResponse,
PythonAdapterOptions,
} from "@poofnew/vibe-check/adapters";
Explore complete working examples in the examples/ directory:
Simple agent integration with minimal configuration:
cd examples/basic
bun install
bun run vibe-check run
Use case: Quick start template, testing custom agents
Full-featured Claude SDK integration with tool tracking (TypeScript):
cd examples/claude-agent-sdk
bun install
export ANTHROPIC_API_KEY=your_key
bun run vibe-check run
Use case: Production Claude agents, comprehensive testing
Python SDK integration using the process-based adapter:
cd examples/python-agent
bun install
./setup.sh # Creates Python venv and installs claude-agent-sdk
export ANTHROPIC_API_KEY=your_key
bun run vibe-check run
Use case: Python-based Claude agents, multi-language support
The Python adapter uses a JSON protocol over stdin/stdout to communicate with Python agent scripts:
import { PythonAgentAdapter } from "@poofnew/vibe-check/adapters";
const adapter = new PythonAgentAdapter({
scriptPath: "./agent.py",
pythonPath: "./.venv/bin/python",
env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY },
});
export default defineConfig({
agent: adapter.createAgent(),
agentType: "claude-code",
});
| Eval File | Category | Judges Used |
|---|---|---|
| basic.eval.json | basic | llm-code-quality |
| code-gen.eval.json | code-gen | file-existence, pattern-match, syntax-validation |
| tool-usage.eval.json | tool | tool-invocation |
| multi-turn.eval.json | multi-turn | - |
| route-to-coding.eval.json | routing | agent-routing |
| route-to-research.eval.json | routing | agent-routing |
| route-to-reviewer.eval.json | routing | agent-routing |
| route-intent-classification.eval.json | tool | tool-invocation |
| tool-chain-explore-modify.eval.json | tool | tool-invocation |
| tool-chain-search-replace.eval.json | tool | tool-invocation |
| tool-chain-bash.eval.json | tool | tool-invocation |
| tool-chain-analysis.eval.json | tool | tool-invocation |
| multi-file-feature.eval.json | code-gen | file-existence, pattern-match, syntax-validation |
| skill-invocation.eval.json | tool | tool-invocation, skill-invocation |
| code-review.eval.json | basic | llm-code-quality |
| debug-workflow.eval.json | code-gen | file-existence, pattern-match, syntax-validation |
Advanced custom validation logic:
cd examples/custom-judges
bun install
bun run vibe-check run
Use case: Domain-specific validation, custom metrics
Multi-turn conversation testing with session persistence:
cd examples/multi-turn
bun install
bun run vibe-check run
Use case: Conversational agents, iterative refinement flows
Demonstrates the learning system with a mock agent that has deliberate flaws:
cd examples/learning
bun install
bun run vibe-check run # Runs evals (some will fail by design)
bun run vibe-check learn stats # Shows learning system status
bun run vibe-check learn analyze # Analyzes failures (requires ANTHROPIC_API_KEY)
Use case: Understanding the learning system, testing failure analysis pipeline
The example includes:
Optimize your eval suite for speed and reliability:
export default defineConfig({
parallel: true,
maxConcurrency: 5, // Balance between speed and resource usage
});
Tip: Higher concurrency = faster but more memory/API usage. Start with 3-5.
# Run only critical tests during development
vibe-check run --tag critical
# Run specific categories
vibe-check run --category tool code-gen
# Run single test for debugging
vibe-check run --id my-test-id
export default defineConfig({
timeout: 60000, // Default for all tests
});
// Override per eval case
{
"id": "quick-test",
"timeout": 10000, // Fast tests
// ...
}
{
"id": "complex-generation",
"timeout": 300000, // Longer timeout for complex tasks
// ...
}
export default defineConfig({
maxRetries: 2, // Retry failed tests
retryDelayMs: 1000, // Initial delay
retryBackoffMultiplier: 2, // Exponential backoff
});
Tip: Enable retries for flaky network/API tests, disable for deterministic tests.
export default defineConfig({
trials: 3, // Run each test 3 times
trialPassThreshold: 0.67, // Pass if 2/3 succeed
});
// Or per eval
{
"id": "flaky-test",
"trials": { "count": 5, "passThreshold": 0.8 },
// ...
}
Tip: Use trials for non-deterministic agent behavior, but avoid over-reliance.
By default, vibe-check creates temporary workspaces and cleans them up after
each eval. Use preserveWorkspaces: true for debugging:
export default defineConfig({
preserveWorkspaces: true, // Keep workspaces for inspection
// ...
});
For full control over workspace lifecycle, use createWorkspace and
cleanupWorkspace hooks:
import { defineConfig, type EvalWorkspace } from "@poofnew/vibe-check";
import * as fs from "fs/promises";
import * as path from "path";
import { execFile } from "child_process";
import { promisify } from "util";
const execFileAsync = promisify(execFile);
export default defineConfig({
createWorkspace: async (): Promise<EvalWorkspace> => {
const id = `ws-${Date.now()}-${Math.random().toString(36).slice(2)}`;
const wsPath = path.join(process.cwd(), "__evals__/results/workspaces", id);
// Copy your template (including node_modules for fast setup)
await fs.cp("./template", wsPath, { recursive: true });
// Optional: install dependencies if not included in template
// await execFileAsync('npm', ['install'], { cwd: wsPath });
return { id, path: wsPath };
},
cleanupWorkspace: async (workspace: EvalWorkspace): Promise<void> => {
await fs.rm(workspace.path, { recursive: true, force: true });
},
// ...
});
Benefits of custom workspace hooks:
# Ensure config exists
ls vibe-check.config.ts
# Or specify path
vibe-check run --config ./path/to/config.ts
// Check testDir and testMatch in config
export default defineConfig({
testDir: "./__evals__", // Must exist
testMatch: ["**/*.eval.json"], // Must match file names
});
# List all detected evals
vibe-check list
// Increase timeout
export default defineConfig({
timeout: 300000, // 5 minutes
});
// Or per eval
{
"timeout": 600000 // 10 minutes for slow tests
}
# Install peer dependencies
bun add @anthropic-ai/sdk @anthropic-ai/claude-agent-sdk
# Verify installation
bun pm ls | grep anthropic
// Reduce concurrency for stability
export default defineConfig({
parallel: true,
maxConcurrency: 2, // Lower for CI environments
maxRetries: 3, // More retries for flaky CI networks
});
# Reduce concurrency
vibe-check run --config config-with-lower-concurrency.ts
# Or run tests in batches
vibe-check run --category tool
vibe-check run --category code-gen
# Enable verbose output
vibe-check run -v
# Preserve workspaces for inspection
vibe-check run --config config-with-preserve-workspaces.ts
Q: What agent frameworks does vibe-check support?
A: Any agent that can be wrapped in an async (prompt, context) => AgentResult
function. Built-in support for Claude SDK (TypeScript and Python via adapters),
but works with LangChain, custom agents, or any LLM framework.
Q: Can I use this with other LLMs (OpenAI, Gemini, etc.)?
A: Yes! The framework is LLM-agnostic. Just implement the agent function to call your preferred LLM.
Q: Do I need Bun or can I use Node/npm?
A: While optimized for Bun, vibe-check works with Node.js 18+ and npm/pnpm. Bun is recommended for best performance.
Q: Can I use Python agents with vibe-check?
A: Yes! Use the PythonAgentAdapter from @poofnew/vibe-check/adapters. It
spawns Python scripts as subprocesses and communicates via JSON over
stdin/stdout. See the Python Agent SDK Integration
example.
Q: How do I test multi-file code generation?
A: Use code-gen category with multiple targetFiles:
{
"category": "code-gen",
"targetFiles": ["src/index.ts", "src/utils.ts", "test/index.test.ts"],
"expectedPatterns": [...],
"judges": ["file-existence", "pattern-match"]
}
Q: Can I use custom validation logic?
A: Yes! Create custom judges extending BaseJudge. See
Custom Judges.
Q: How do I handle authentication/secrets in tests?
A: Use environment variables:
export default defineConfig({
agent: async (prompt, context) => {
const apiKey = process.env.ANTHROPIC_API_KEY;
// ...
},
});
Q: Does the learning system modify my prompts automatically?
A: No! It generates rule suggestions that require human review (unless
autoApprove: true). You control what gets integrated.
Q: How many failures do I need to generate useful rules?
A: Start with 10+ failures. The system works best with 20-50 failures showing clear patterns.
Q: Can I use production logs for learning?
A: Yes! Export failures to JSONL format and use --source jsonl.
Q: How fast does vibe-check run?
A: Depends on agent speed and concurrency. With maxConcurrency: 5 and Claude
SDK, expect ~10-20 evals/minute.
Q: Can I run tests in CI/CD?
A: Yes! Use exit codes for CI integration:
vibe-check run || exit 1 # Fails CI if tests fail
Q: How do I speed up slow test suites?
A: See Performance Tips. Key strategies: increase concurrency, use selective runs, optimize timeouts.
Q: Where are test artifacts stored?
A: By default in __evals__/results/. Workspaces are temporary unless
preserveWorkspaces: true.
Q: How do I debug a single failing test?
A: Run with verbose mode and workspace preservation:
vibe-check run --id failing-test -v
Set preserveWorkspaces: true to inspect the working directory.
Q: Why are my tests flaky?
A: LLMs are non-deterministic. Use trials with pass thresholds:
{ trials: 3, trialPassThreshold: 0.67 }
# Clone the repository
git clone https://github.com/poofdotnew/vibe-check.git
cd vibe-check
# Install dependencies
bun install
# Run tests
bun test
# Build
bun run build
# Type check
bun run typecheck
src/
├── bin/ # CLI entry points
│ ├── vibe-check.ts # Main executable
│ └── cli.ts # CLI commands
├── config/ # Configuration
│ ├── types.ts # Type definitions
│ ├── schemas.ts # Zod validation schemas
│ └── config-loader.ts # Config file loading
├── harness/ # Test execution
│ ├── test-harness.ts # Main execution engine
│ └── workspace-manager.ts
├── judges/ # Evaluation judges
│ ├── judge-interface.ts
│ ├── judge-registry.ts
│ └── builtin/ # Built-in judges
├── runner/ # Test orchestration
│ └── eval-runner.ts
├── learning/ # Learning system
└── utils/ # Utilities
name: Eval Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v1
with:
bun-version: latest
- name: Install dependencies
run: bun install
- name: Run vibe-check
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: bun run vibe-check run
test:
image: oven/bun:latest
script:
- bun install
- bun run vibe-check run
variables:
ANTHROPIC_API_KEY: $ANTHROPIC_API_KEY
version: 2.1
jobs:
test:
docker:
- image: oven/bun:latest
steps:
- checkout
- run: bun install
- run: bun run vibe-check run
maxConcurrency: 2-3 for stable CI runsFAQs
AI agent evaluation framework for Claude and beyond
We found that @poofnew/vibe-check demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Company News
Socket won two 2026 Reppy Awards from RepVue, ranking in the top 5% of all sales orgs. AE Alexandra Lister shares what it's like to grow a sales career here.

Security News
NIST will stop enriching most CVEs under a new risk-based model, narrowing the NVD's scope as vulnerability submissions continue to surge.

Company News
/Security News
Socket is an initial recipient of OpenAI's Cybersecurity Grant Program, which commits $10M in API credits to defenders securing open source software.