Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement
Sign In

@skilljack/evals

Package Overview
Dependencies
Maintainers
1
Versions
5
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@skilljack/evals

CLI for evaluating AI agent skill discoverability, adherence, and output quality. Runs as standalone CLI or GitHub Action.

latest
Source
npmnpm
Version
1.2.1
Version published
Maintainers
1
Created
Source

skilljack-evals

CLI for evaluating AI agent skills across multiple agent frameworks. Tests how well agents discover, load, and execute Agent Skills — measuring discoverability, instruction adherence, and output quality.

Supports the Claude Agent SDK, Vercel AI SDK, and OpenAI Agents SDK. Runs standalone or as a GitHub Action.

What are Agent Skills?

Agent Skills are a lightweight, open-source format for extending AI agent capabilities. Each skill is a folder containing a SKILL.md file with metadata and instructions that agents can discover and use. Learn more at agentskills.io.

Requirements

  • Node.js >= 20.0.0
  • API key for your chosen runner (see API Keys below)

Installation

npm install
npm run build

Quick Start

# Run the example greeting evaluation
skilljack-evals run evals/example-greeting/tasks.yaml --verbose

# Deterministic scoring only (no LLM judge, free)
skilljack-evals run evals/example-greeting/tasks.yaml --no-judge

# Validate a task file without running
skilljack-evals validate evals/example-greeting/tasks.yaml

Building Skills with Evals

Start by writing eval tasks that describe the outcomes you want, then build your skill to pass them. This eval-first approach works like TDD for agent skills:

  • Decide if a skill is the right tool — Skills are for capabilities that should only activate on demand. For instructions that always apply, use CLAUDE.md or AGENTS.md. For validation and formatting, consider static analysis, pre-commit hooks, or agent hooks instead.

  • Define desired outcomes — Write eval tasks with the prompts users will say, the markers your skill should output, and a checklist of what "good" looks like.

  • Add false-positive tests — Include prompts that are similar but should not trigger the skill. These catch over-eager activation and are just as important as positive tests.

  • Create a minimal SKILL.md — Start with basic instructions and metadata.

  • Run evals and iterate — Use skilljack-evals run to see where the skill falls short. Deterministic checks (--no-judge) are free and fast for rapid iteration. Add the LLM judge when you're ready to evaluate output quality.

  • Keep the eval suite — As you update the skill, run evals as a regression check. Add them to CI with the GitHub Action to catch regressions automatically.

# Scaffold eval tasks for a new skill
skilljack-evals create-eval my-skill -o evals/my-skill/tasks.yaml

# Fast iteration loop (deterministic only, no API cost for judging)
skilljack-evals run evals/my-skill/tasks.yaml --no-judge --verbose

# Full evaluation with LLM judge
skilljack-evals run evals/my-skill/tasks.yaml --verbose

This workflow ensures your skill is discoverable from the right prompts, doesn't activate when it shouldn't, and produces the output quality you expect.

Multi-Runner Support

Three runners are available, selected via the --runner CLI flag:

RunnerFlagModel FormatExample
Claude Agent SDK (default)--runner claude-sdkModel aliasessonnet, haiku
Vercel AI SDK--runner vercel-aiprovider:modelanthropic:claude-sonnet-4-6, google:gemini-2.5-pro, openai:gpt-5.2, openrouter:deepseek/deepseek-v3.2
OpenAI Agents SDK--runner openai-agentsPlain model namegpt-5.2
# Claude SDK (default)
skilljack-evals run evals/example-greeting/tasks.yaml --model sonnet

# Vercel AI SDK with different providers
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "anthropic:claude-sonnet-4-6"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "google:gemini-2.5-pro"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "openai:gpt-5.2"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "openrouter:deepseek/deepseek-v3.2"

# OpenRouter — tested models
# openrouter:deepseek/deepseek-v3.2
# openrouter:minimax/minimax-m2.5
# openrouter:moonshotai/kimi-k2.5
# openrouter:z-ai/glm-5
# openrouter:openai/gpt-oss-120b

# OpenAI Agents SDK
skilljack-evals run evals/example-greeting/tasks.yaml --runner openai-agents --model "gpt-5.2"

The Vercel AI SDK and OpenAI Agents SDK runners require their respective peer dependencies:

# Vercel AI SDK
npm install ai zod @ai-sdk/openai @ai-sdk/anthropic @ai-sdk/google @openrouter/ai-sdk-provider

# OpenAI Agents SDK
npm install @openai/agents openai

Skill Support by SDK

Each runner uses the SDK's native mechanism for skill discovery and loading:

Configuration

API Keys

Set the appropriate API key in your environment or a .env file (see .env.example):

RunnerRequired Key
Claude SDKANTHROPIC_API_KEY
Vercel AI (anthropic:)ANTHROPIC_API_KEY
Vercel AI (openai:)OPENAI_API_KEY
Vercel AI (google:)GOOGLE_GENERATIVE_AI_API_KEY
Vercel AI (openrouter:)OPENROUTER_API_KEY
OpenAI AgentsOPENAI_API_KEY

Bedrock

Set these environment variables — the Agent SDK handles the rest:

CLAUDE_CODE_USE_BEDROCK=1
AWS_REGION=us-west-2
AWS_PROFILE=your-profile

Config File

Create an eval.config.yaml in your project root (all fields optional):

models:
  agent: sonnet        # EVAL_AGENT_MODEL
  judge: haiku         # EVAL_JUDGE_MODEL

scoring:
  weights:
    discovery: 0.3
    adherence: 0.4
    output: 0.3

thresholds:
  discovery_rate: 0.8  # EVAL_DISCOVERY_THRESHOLD
  avg_score: 4.0       # EVAL_SCORE_THRESHOLD

runner:
  timeout_ms: 300000   # EVAL_TASK_TIMEOUT_MS
  allowed_write_dirs:
    - ./results/
    - ./fixtures/

output:
  dir: ./results       # EVAL_OUTPUT_DIR
  judge_truncation: 5000
  report_truncation: 2000

ci:
  exit_on_failure: true
  github_summary: false

Precedence (lowest to highest): YAML defaults → eval.config.yaml → environment variables (EVAL_*) → CLI flags.

CLI Commands

run — Full evaluation pipeline

Runs the agent against tasks, scores results, and generates reports.

skilljack-evals run evals/greeting/tasks.yaml \
  --runner vercel-ai --model "google:gemini-2.5-pro" \
  --judge-model haiku \
  --timeout 300000 \
  --tasks gr-001,gr-002 \
  --threshold-discovery 0.8 --threshold-score 4.0 \
  --output-dir ./results \
  --github-summary --verbose

score — Score existing results

skilljack-evals score results.json --judge-model haiku

report — Generate reports from scored results

skilljack-evals report -r results.json -o report.md --json report.json

validate — Check YAML syntax

skilljack-evals validate evals/greeting/tasks.yaml

create-eval — Generate task template

skilljack-evals create-eval greeting -o evals/greeting/tasks.yaml -n 10

parse — Parse YAML to JSON

skilljack-evals parse evals/greeting/tasks.yaml

Architecture

YAML tasks → Config → Runner (Claude SDK / Vercel AI / OpenAI Agents) → Scorer (deterministic + LLM judge) → Report

Pipeline

  • Parse — Load and validate task definitions from YAML
  • Setup — Copy skills to .claude/skills/ in the working directory
  • Run — Execute agent against each task via the selected runner
  • Score — Deterministic checks (free, fast) then optional LLM judge
  • Report — Generate markdown + JSON reports, check pass/fail thresholds
  • Cleanup — Remove copied skills

Scoring

Two scoring methods that can run independently or together:

Deterministic (free, fast):

  • Checks tool calls for skill activation
  • Searches output for expected marker strings
  • Validates expected/forbidden tool usage
  • Binary pass/fail

LLM Judge (richer, ~$0.20/run with default settings):

  • Discovery (0 or 1) — Did the agent load the expected skill?
  • Adherence (1-5) — How well did the agent follow skill instructions?
  • Output Quality (1-5) — Does the output meet task requirements?
  • Failure categorization

Combined score: w_d * discovery + w_a * ((adherence-1)/4) + w_o * ((outputQuality-1)/4)

Failure Categories

CategoryMeaning
discovery_failureAgent didn't load the skill
false_positiveAgent loaded a skill it shouldn't have
instruction_ambiguityAgent misinterpreted instructions
missing_guidanceSkill didn't cover the needed case
agent_errorAgent made a mistake despite guidance
noneNo failure

Task File Format

skill: greeting
version: "1.0"

defaults:
  expected_skill_load: greeting
  criteria:
    discovery: { weight: 0.3 }
    adherence: { weight: 0.4 }
    output: { weight: 0.3 }

tasks:
  - id: gr-001
    prompt: "Hello! Please greet me using the greeting skill."

    # Deterministic checks (optional, free)
    deterministic:
      expect_skill_activation: true
      expect_marker: "GREETING_SUCCESS"
      expect_tool_calls: []
      expect_no_tool_calls: []

    # LLM judge criteria (optional, costs API calls)
    criteria:
      discovery: { weight: 0.3, description: "Should load greeting skill" }
      adherence: { weight: 0.4, description: "Should follow skill format" }
      output: { weight: 0.3, description: "Greeting is friendly" }
    golden_checklist:
      - "Loaded the greeting skill"
      - "Friendly tone"

  # False positive test — skill should NOT activate
  - id: gr-fp-001
    prompt: "What are best practices for email greetings?"
    expected_skill_load: none
    deterministic:
      expect_skill_activation: false

Both deterministic and criteria blocks are optional. If both are present, the scorer runs both and merges results.

GitHub Action

- uses: olaservo/skilljack-evals@v1
  with:
    tasks: evals/commit/tasks.yaml
    threshold-discovery: '0.8'
    threshold-score: '4.0'
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Inputs

InputRequiredDefaultDescription
tasksYesPath to tasks YAML file
runnerNoclaude-sdkRunner type: claude-sdk, vercel-ai, openai-agents
modelNosonnetAgent model
judge-modelNohaikuLLM judge model
configNoPath to eval.config.yaml
threshold-discoveryNo0.8Minimum discovery rate (0-1)
threshold-scoreNo4.0Minimum average score (1-5)
timeoutNo300000Per-task timeout (ms)
tasks-filterNoComma-separated task IDs
skills-dirNoPath to skills directory
no-judgeNofalseSkip LLM judge
no-deterministicNofalseSkip deterministic scoring

Outputs

OutputDescription
passedWhether all thresholds were met
discovery-rateDiscovery rate achieved (0-1)
avg-scoreAverage weighted score
report-pathPath to markdown report
json-pathPath to JSON report

The action writes a condensed summary to $GITHUB_STEP_SUMMARY and exits with code 1 if thresholds are not met.

Library Usage

import {
  parseEvalFile,
  SkillJudge,
  generateReport,
  runPipeline,
  scoreDeterministic,
  loadConfig,
} from '@skilljack/evals';

// Full pipeline
const result = await runPipeline({
  tasksFile: 'evals/greeting/tasks.yaml',
  configOverrides: { defaultAgentModel: 'sonnet' },
  verbose: true,
});

// Or individual steps
const evaluation = await parseEvalFile('path/to/tasks.yaml');
const judge = new SkillJudge({ model: 'haiku' });
const score = await judge.judgeResult(task, result);
const detScore = scoreDeterministic(task, result);
const report = generateReport(evaluation, results, scores);

Development

npm run dev        # Run CLI in dev mode (tsx)
npm run build      # Compile TypeScript
npm run typecheck  # Type check without emitting
npm run start      # Run compiled CLI

Keywords

agent-skills

FAQs

Package last updated on 01 Mar 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts