
Security News
RubyGems Adds Cooldown Feature to Bundler for Newly Published Gems
RubyGems and Bundler 4.0.13 introduced an opt-in cooldown feature that delays newly published gems during dependency resolution.
@skilljack/evals
Advanced tools
CLI for evaluating AI agent skill discoverability, adherence, and output quality. Runs as standalone CLI or GitHub Action.
CLI for evaluating AI agent skills across multiple agent frameworks. Tests how well agents discover, load, and execute Agent Skills — measuring discoverability, instruction adherence, and output quality.
Supports the Claude Agent SDK, Vercel AI SDK, and OpenAI Agents SDK. Runs standalone or as a GitHub Action.
Agent Skills are a lightweight, open-source format for extending AI agent capabilities. Each skill is a folder containing a SKILL.md file with metadata and instructions that agents can discover and use. Learn more at agentskills.io.
npm install
npm run build
# Run the example greeting evaluation
skilljack-evals run evals/example-greeting/tasks.yaml --verbose
# Deterministic scoring only (no LLM judge, free)
skilljack-evals run evals/example-greeting/tasks.yaml --no-judge
# Validate a task file without running
skilljack-evals validate evals/example-greeting/tasks.yaml
Start by writing eval tasks that describe the outcomes you want, then build your skill to pass them. This eval-first approach works like TDD for agent skills:
Decide if a skill is the right tool — Skills are for capabilities that should only activate on demand. For instructions that always apply, use CLAUDE.md or AGENTS.md. For validation and formatting, consider static analysis, pre-commit hooks, or agent hooks instead.
Define desired outcomes — Write eval tasks with the prompts users will say, the markers your skill should output, and a checklist of what "good" looks like.
Add false-positive tests — Include prompts that are similar but should not trigger the skill. These catch over-eager activation and are just as important as positive tests.
Create a minimal SKILL.md — Start with basic instructions and metadata.
Run evals and iterate — Use skilljack-evals run to see where the skill falls short. Deterministic checks (--no-judge) are free and fast for rapid iteration. Add the LLM judge when you're ready to evaluate output quality.
Keep the eval suite — As you update the skill, run evals as a regression check. Add them to CI with the GitHub Action to catch regressions automatically.
# Scaffold eval tasks for a new skill
skilljack-evals create-eval my-skill -o evals/my-skill/tasks.yaml
# Fast iteration loop (deterministic only, no API cost for judging)
skilljack-evals run evals/my-skill/tasks.yaml --no-judge --verbose
# Full evaluation with LLM judge
skilljack-evals run evals/my-skill/tasks.yaml --verbose
This workflow ensures your skill is discoverable from the right prompts, doesn't activate when it shouldn't, and produces the output quality you expect.
Three runners are available, selected via the --runner CLI flag:
| Runner | Flag | Model Format | Example |
|---|---|---|---|
| Claude Agent SDK (default) | --runner claude-sdk | Model aliases | sonnet, haiku |
| Vercel AI SDK | --runner vercel-ai | provider:model | anthropic:claude-sonnet-4-6, google:gemini-2.5-pro, openai:gpt-5.2, openrouter:deepseek/deepseek-v3.2 |
| OpenAI Agents SDK | --runner openai-agents | Plain model name | gpt-5.2 |
# Claude SDK (default)
skilljack-evals run evals/example-greeting/tasks.yaml --model sonnet
# Vercel AI SDK with different providers
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "anthropic:claude-sonnet-4-6"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "google:gemini-2.5-pro"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "openai:gpt-5.2"
skilljack-evals run evals/example-greeting/tasks.yaml --runner vercel-ai --model "openrouter:deepseek/deepseek-v3.2"
# OpenRouter — tested models
# openrouter:deepseek/deepseek-v3.2
# openrouter:minimax/minimax-m2.5
# openrouter:moonshotai/kimi-k2.5
# openrouter:z-ai/glm-5
# openrouter:openai/gpt-oss-120b
# OpenAI Agents SDK
skilljack-evals run evals/example-greeting/tasks.yaml --runner openai-agents --model "gpt-5.2"
The Vercel AI SDK and OpenAI Agents SDK runners require their respective peer dependencies:
# Vercel AI SDK
npm install ai zod @ai-sdk/openai @ai-sdk/anthropic @ai-sdk/google @openrouter/ai-sdk-provider
# OpenAI Agents SDK
npm install @openai/agents openai
Each runner uses the SDK's native mechanism for skill discovery and loading:
.claude/skills/ and the Skill tool. See Claude Code Skills and Agent Skills format.loadSkill tool defined in the runner, following the Agent Skills cookbook guide.shellTool() with local skill bundles. See Skills in OpenAI API and the Skills cookbook.Set the appropriate API key in your environment or a .env file (see .env.example):
| Runner | Required Key |
|---|---|
| Claude SDK | ANTHROPIC_API_KEY |
Vercel AI (anthropic:) | ANTHROPIC_API_KEY |
Vercel AI (openai:) | OPENAI_API_KEY |
Vercel AI (google:) | GOOGLE_GENERATIVE_AI_API_KEY |
Vercel AI (openrouter:) | OPENROUTER_API_KEY |
| OpenAI Agents | OPENAI_API_KEY |
Set these environment variables — the Agent SDK handles the rest:
CLAUDE_CODE_USE_BEDROCK=1
AWS_REGION=us-west-2
AWS_PROFILE=your-profile
Create an eval.config.yaml in your project root (all fields optional):
models:
agent: sonnet # EVAL_AGENT_MODEL
judge: haiku # EVAL_JUDGE_MODEL
scoring:
weights:
discovery: 0.3
adherence: 0.4
output: 0.3
thresholds:
discovery_rate: 0.8 # EVAL_DISCOVERY_THRESHOLD
avg_score: 4.0 # EVAL_SCORE_THRESHOLD
runner:
timeout_ms: 300000 # EVAL_TASK_TIMEOUT_MS
allowed_write_dirs:
- ./results/
- ./fixtures/
output:
dir: ./results # EVAL_OUTPUT_DIR
judge_truncation: 5000
report_truncation: 2000
ci:
exit_on_failure: true
github_summary: false
Precedence (lowest to highest): YAML defaults → eval.config.yaml → environment variables (EVAL_*) → CLI flags.
run — Full evaluation pipelineRuns the agent against tasks, scores results, and generates reports.
skilljack-evals run evals/greeting/tasks.yaml \
--runner vercel-ai --model "google:gemini-2.5-pro" \
--judge-model haiku \
--timeout 300000 \
--tasks gr-001,gr-002 \
--threshold-discovery 0.8 --threshold-score 4.0 \
--output-dir ./results \
--github-summary --verbose
score — Score existing resultsskilljack-evals score results.json --judge-model haiku
report — Generate reports from scored resultsskilljack-evals report -r results.json -o report.md --json report.json
validate — Check YAML syntaxskilljack-evals validate evals/greeting/tasks.yaml
create-eval — Generate task templateskilljack-evals create-eval greeting -o evals/greeting/tasks.yaml -n 10
parse — Parse YAML to JSONskilljack-evals parse evals/greeting/tasks.yaml
YAML tasks → Config → Runner (Claude SDK / Vercel AI / OpenAI Agents) → Scorer (deterministic + LLM judge) → Report
.claude/skills/ in the working directoryTwo scoring methods that can run independently or together:
Deterministic (free, fast):
LLM Judge (richer, ~$0.20/run with default settings):
Combined score: w_d * discovery + w_a * ((adherence-1)/4) + w_o * ((outputQuality-1)/4)
| Category | Meaning |
|---|---|
discovery_failure | Agent didn't load the skill |
false_positive | Agent loaded a skill it shouldn't have |
instruction_ambiguity | Agent misinterpreted instructions |
missing_guidance | Skill didn't cover the needed case |
agent_error | Agent made a mistake despite guidance |
none | No failure |
skill: greeting
version: "1.0"
defaults:
expected_skill_load: greeting
criteria:
discovery: { weight: 0.3 }
adherence: { weight: 0.4 }
output: { weight: 0.3 }
tasks:
- id: gr-001
prompt: "Hello! Please greet me using the greeting skill."
# Deterministic checks (optional, free)
deterministic:
expect_skill_activation: true
expect_marker: "GREETING_SUCCESS"
expect_tool_calls: []
expect_no_tool_calls: []
# LLM judge criteria (optional, costs API calls)
criteria:
discovery: { weight: 0.3, description: "Should load greeting skill" }
adherence: { weight: 0.4, description: "Should follow skill format" }
output: { weight: 0.3, description: "Greeting is friendly" }
golden_checklist:
- "Loaded the greeting skill"
- "Friendly tone"
# False positive test — skill should NOT activate
- id: gr-fp-001
prompt: "What are best practices for email greetings?"
expected_skill_load: none
deterministic:
expect_skill_activation: false
Both deterministic and criteria blocks are optional. If both are present, the scorer runs both and merges results.
- uses: olaservo/skilljack-evals@v1
with:
tasks: evals/commit/tasks.yaml
threshold-discovery: '0.8'
threshold-score: '4.0'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
| Input | Required | Default | Description |
|---|---|---|---|
tasks | Yes | — | Path to tasks YAML file |
runner | No | claude-sdk | Runner type: claude-sdk, vercel-ai, openai-agents |
model | No | sonnet | Agent model |
judge-model | No | haiku | LLM judge model |
config | No | — | Path to eval.config.yaml |
threshold-discovery | No | 0.8 | Minimum discovery rate (0-1) |
threshold-score | No | 4.0 | Minimum average score (1-5) |
timeout | No | 300000 | Per-task timeout (ms) |
tasks-filter | No | — | Comma-separated task IDs |
skills-dir | No | — | Path to skills directory |
no-judge | No | false | Skip LLM judge |
no-deterministic | No | false | Skip deterministic scoring |
| Output | Description |
|---|---|
passed | Whether all thresholds were met |
discovery-rate | Discovery rate achieved (0-1) |
avg-score | Average weighted score |
report-path | Path to markdown report |
json-path | Path to JSON report |
The action writes a condensed summary to $GITHUB_STEP_SUMMARY and exits with code 1 if thresholds are not met.
import {
parseEvalFile,
SkillJudge,
generateReport,
runPipeline,
scoreDeterministic,
loadConfig,
} from '@skilljack/evals';
// Full pipeline
const result = await runPipeline({
tasksFile: 'evals/greeting/tasks.yaml',
configOverrides: { defaultAgentModel: 'sonnet' },
verbose: true,
});
// Or individual steps
const evaluation = await parseEvalFile('path/to/tasks.yaml');
const judge = new SkillJudge({ model: 'haiku' });
const score = await judge.judgeResult(task, result);
const detScore = scoreDeterministic(task, result);
const report = generateReport(evaluation, results, scores);
npm run dev # Run CLI in dev mode (tsx)
npm run build # Compile TypeScript
npm run typecheck # Type check without emitting
npm run start # Run compiled CLI
FAQs
CLI for evaluating AI agent skill discoverability, adherence, and output quality. Runs as standalone CLI or GitHub Action.
We found that @skilljack/evals demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
RubyGems and Bundler 4.0.13 introduced an opt-in cooldown feature that delays newly published gems during dependency resolution.

Security News
pnpm 11.5 now recognizes npm staged publish approvals in release metadata, preventing those releases from being mistaken for lower-trust package publishes.

Security News
Federal audit finds NIST lacked a plan to clear the NVD backlog, wasted funds on duplicate work, and delayed use of CISA data.