
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
The testing framework for Agent Skills. Lint, test triggering, improve, and evaluate your SKILL.md files.
skilltest is a standalone CLI for the Agent Skills ecosystem (spec: https://agentskills.io). Think of it as pytest for skills.
The repository itself uses a fast Vitest suite for offline unit and integration coverage of the parser, linters, trigger math, config resolution, reporters, and linter orchestration.
Agent Skills are quick to write but hard to validate before deployment:
scripts/, references/, or assets/ fail silently.skilltest closes this gap with one CLI and five modes.
Global:
npm install -g skilltest
Without install:
npx skilltest --help
Requires Node.js >=18.
Lint a skill:
skilltest lint ./path/to/skill
Trigger test:
skilltest trigger ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
End-to-end eval:
skilltest eval ./path/to/skill --provider anthropic --model claude-sonnet-4-5-20250929
Run full quality gate:
skilltest check ./path/to/skill --provider anthropic --min-f1 0.8 --min-assert-pass-rate 0.9
Propose a verified rewrite without touching the source file:
skilltest improve ./path/to/skill --provider anthropic
Apply the verified rewrite in place:
skilltest improve ./path/to/skill --provider anthropic --apply
Write a self-contained HTML report:
skilltest check ./path/to/skill --html ./reports/check.html
Model-backed commands default to --concurrency 5. Use --concurrency 1 to force
the old sequential execution order. Seeded trigger runs stay deterministic regardless
of concurrency.
lint, trigger, eval, and check support --html <path> for offline reports.
improve is terminal/JSON only in v1.
Example lint summary:
skilltest lint
target: ./test-fixtures/sample-skill
summary: 29/29 checks passed, 0 warnings, 0 failures
skilltest resolves config in this order:
.skilltestrc in the target skill root.skilltestrc in the current working directorypackage.json containing skilltestrcCLI flags override config values.
Example .skilltestrc:
{
"provider": "anthropic",
"model": "claude-sonnet-4-5-20250929",
"concurrency": 5,
"trigger": {
"numQueries": 20,
"threshold": 0.8,
"seed": 123
},
"eval": {
"numRuns": 5,
"threshold": 0.9,
"maxToolIterations": 10
}
}
skilltest lint <path-to-skill>Static analysis only. Fast and offline.
What it checks:
name required, max 64, lowercase/numbers/hyphens, no leading/trailing/consecutive hyphensdescription required, non-empty, max 1024licenseSKILL.md exceeds 500 linesscripts/, references/, assets/sudo, disable approvals, require_escalated)SKILL.md is large and no references/ existsallowed-toolsFlags:
--html <path> write a self-contained HTML report--plugin <path> load a custom lint plugin file (repeatable)You can run custom lint rules alongside the built-in checks. Plugin rules use the
same LintContext and LintIssue types as the core linter, and their results
appear in the same LintReport.
Config:
{
"lint": {
"plugins": ["./my-rules.js"]
}
}
CLI:
skilltest lint ./skill --plugin ./my-rules.js
Minimal plugin example:
export default {
rules: [
{
checkId: "custom:no-todo",
title: "No TODO comments",
check(context) {
const body = context.frontmatter.content;
if (/\bTODO\b/.test(body)) {
return [
{
id: "custom.no-todo",
checkId: "custom:no-todo",
title: "No TODO comments",
status: "warn",
message: "SKILL.md contains a TODO marker."
}
];
}
return [
{
id: "custom.no-todo",
checkId: "custom:no-todo",
title: "No TODO comments",
status: "pass",
message: "No TODO markers found."
}
];
}
}
]
};
Notes:
import()..js and .mjs work directly; .ts plugins must be precompiled by the user.--plugin values replace config-file lint.plugins values.skilltest trigger <path-to-skill>Measures trigger behavior for your skill description with model simulation.
Flow:
name and description from frontmatter.--compareFor reproducible fake-skill sampling, pass --seed <number>. When a seed is used,
terminal and JSON output include it so the run can be repeated exactly. If you use
.skilltestrc, trigger.seed sets the default and the CLI flag overrides it.
The fake-skill setup is precomputed before requests begin, so the same seed produces
the same trigger cases at any concurrency level.
Flags:
--model <model> default: claude-sonnet-4-5-20250929--provider <anthropic|openai> default: anthropic--queries <path> use custom queries JSON--compare <path> path to a sibling skill directory to use as a competitor (repeatable)--num-queries <n> default: 20 (must be even)--seed <number> RNG seed for reproducible fake-skill sampling--concurrency <n> default: 5--html <path> write a self-contained HTML report--save-queries <path> save generated query set--api-key <key> explicit key override--verbose show full model decision textTest whether your skill is distinctive enough to be selected over similar real skills:
skilltest trigger ./my-skill --compare ../similar-skill-1 --compare ../similar-skill-2
Config:
{
"trigger": {
"compare": ["../similar-skill-1", "../similar-skill-2"]
}
}
Comparative mode includes the real competitor skills in the candidate list alongside fake skills. This reveals confusion between skills with overlapping descriptions that standard trigger testing would miss.
skilltest eval <path-to-skill>Runs full skill behavior and grades outputs against assertions.
Flow:
SKILL.md as system instructions.Flags:
--prompts <path> custom prompts JSON--model <model> default: claude-sonnet-4-5-20250929--grader-model <model> default: same as --model--provider <anthropic|openai> default: anthropic--concurrency <n> default: 5--html <path> write a self-contained HTML report--save-results <path> write full JSON result--api-key <key> explicit key override--verbose show full model responsesConfig-only eval setting:
eval.maxToolIterations default: 10 safety cap for tool-aware eval loopsWhen an eval prompt defines tools, skilltest runs the prompt in a mock tool
environment instead of plain text-only execution. The model can call the mocked
tools during eval, and skilltest records the calls alongside the normal grader
assertions.
Tool responses are always mocked. skilltest does not execute real tools,
scripts, shell commands, MCP servers, or APIs during eval.
Example prompt file:
[
{
"prompt": "Parse this deployment checklist and tell me what is missing.",
"assertions": ["output should mention the missing rollback plan"],
"tools": [
{
"name": "read_file",
"description": "Read a file from the workspace",
"parameters": [
{ "name": "path", "type": "string", "description": "File path to read", "required": true }
],
"responses": {
"{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
"*": "[mock] File not found"
}
},
{
"name": "run_script",
"description": "Execute a shell script",
"parameters": [
{ "name": "command", "type": "string", "description": "Command to run", "required": true }
],
"responses": {
"*": "Script executed successfully. Output: 3 items checked, 1 missing."
}
}
],
"toolAssertions": [
{ "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
{ "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
{
"type": "tool_argument_match",
"toolName": "read_file",
"expectedArgs": { "path": "checklist.md" },
"description": "Model should read checklist.md specifically"
}
]
}
]
Run it with:
skilltest eval ./my-skill --prompts ./eval-prompts.json
skilltest check <path-to-skill>Runs lint + trigger + eval in one command and applies quality thresholds.
Default behavior:
1, run trigger and eval in parallel.Flags:
--provider <anthropic|openai> default: anthropic--model <model> default: claude-sonnet-4-5-20250929 (auto-switches to gpt-4.1-mini for --provider openai when unchanged)--grader-model <model> default: same as resolved --model--api-key <key> explicit key override--queries <path> custom trigger queries JSON--compare <path> path to a sibling skill directory to use as a competitor (repeatable)--num-queries <n> default: 20 (must be even)--seed <number> RNG seed for reproducible trigger sampling--prompts <path> custom eval prompts JSON--plugin <path> load a custom lint plugin file (repeatable)--concurrency <n> default: 5 (1 keeps the old sequential check behavior)--html <path> write a self-contained HTML report--min-f1 <n> default: 0.8--min-assert-pass-rate <n> default: 0.9--save-results <path> save combined check result JSON--continue-on-lint-fail continue trigger/eval even if lint fails--verbose include detailed trigger/eval sectionsskilltest improve <path-to-skill>Rewrites SKILL.md, verifies the rewrite on a frozen test set, and optionally
applies it.
Default behavior:
check with continue-on-lint-fail=true.SKILL.md.namelicense when one already existscheck against a copied skill directory with
the frozen trigger/eval inputs.Flags:
--provider <anthropic|openai> default: anthropic--model <model> default: claude-sonnet-4-5-20250929 (auto-switches to gpt-4.1-mini for --provider openai when unchanged)--api-key <key> explicit key override--queries <path> custom trigger queries JSON--compare <path> path to a sibling skill directory to use as a competitor (repeatable)--num-queries <n> default: 20 (must be even when auto-generating)--seed <number> RNG seed for reproducible trigger sampling--prompts <path> custom eval prompts JSON--plugin <path> load a custom lint plugin file (repeatable)--concurrency <n> default: 5--output <path> write the verified candidate SKILL.md to a separate file--save-results <path> save full improve result JSON--min-f1 <n> default: 0.8--min-assert-pass-rate <n> default: 0.9--apply write the verified rewrite back to the source SKILL.md--verbose include full baseline and verification reportsNotes:
improve is dry-run by default.--apply only writes when parse, lint, trigger, and eval verification all pass.--help show help--version show version--json output only valid JSON to stdout--no-color disable terminal colorsTrigger queries (--queries):
[
{
"query": "Please validate this deployment checklist and score it.",
"should_trigger": true
},
{
"query": "Write a SQL migration for adding an index.",
"should_trigger": false
}
]
Eval prompts (--prompts):
[
{
"prompt": "Validate this markdown checklist for a production release.",
"assertions": [
"output should include pass/warn/fail style categorization",
"output should provide at least one remediation recommendation"
]
}
]
Tool-aware eval prompts (--prompts):
[
{
"prompt": "Parse this deployment checklist and tell me what is missing.",
"assertions": ["output should mention remediation steps"],
"tools": [
{
"name": "read_file",
"description": "Read a file from the workspace",
"parameters": [
{ "name": "path", "type": "string", "description": "File path to read", "required": true }
],
"responses": {
"{\"path\":\"checklist.md\"}": "# Deploy Checklist\n- [x] Migrations\n- [ ] Rollback plan\n- [x] Alerting",
"*": "[mock] File not found"
}
},
{
"name": "run_script",
"description": "Execute a shell script",
"parameters": [
{ "name": "command", "type": "string", "description": "Command to run", "required": true }
],
"responses": {
"*": "Script executed successfully. Output: 3 items checked, 1 missing."
}
}
],
"toolAssertions": [
{ "type": "tool_called", "toolName": "read_file", "description": "Model should read the checklist file" },
{ "type": "tool_not_called", "toolName": "delete_file", "description": "Model should not delete any files" },
{
"type": "tool_argument_match",
"toolName": "read_file",
"expectedArgs": { "path": "checklist.md" },
"description": "Model should read checklist.md specifically"
}
]
}
]
Exit codes:
0: success1: quality gate failed (lint, check, improve blocked, or other command-specific failure conditions)2: runtime/config/API/parse errorJSON mode examples:
skilltest lint ./skill --json
skilltest trigger ./skill --json
skilltest eval ./skill --json
skilltest check ./skill --json
skilltest improve ./skill --json
HTML report examples:
skilltest lint ./skill --html ./reports/lint.html
skilltest trigger ./skill --html ./reports/trigger.html
skilltest eval ./skill --html ./reports/eval.html
skilltest check ./skill --json --html ./reports/check.html
Seeded trigger example:
skilltest trigger ./skill --seed 123
Anthropic:
export ANTHROPIC_API_KEY=your-key
OpenAI:
export OPENAI_API_KEY=your-key
Override at runtime:
skilltest trigger ./skill --api-key your-key
Current provider status:
anthropic: implementedopenai: implementedOpenAI quick example:
skilltest trigger ./path/to/skill --provider openai --model gpt-4.1-mini
skilltest eval ./path/to/skill --provider openai --model gpt-4.1-mini
Note:
--provider openai and keep the Anthropic default model value, skilltest automatically switches to gpt-4.1-mini.GitHub Actions example to lint skills on pull requests:
name: skill-lint
on:
pull_request:
paths:
- "**/SKILL.md"
- "**/references/**"
- "**/scripts/**"
- "**/assets/**"
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm run lint
- run: npm run test
- run: npm run build
- run: npx skilltest lint path/to/skill --json
Optional nightly trigger/eval:
name: skill-eval-nightly
on:
schedule:
- cron: "0 4 * * *"
jobs:
trigger-eval:
runs-on: ubuntu-latest
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm run build
- run: npx skilltest trigger path/to/skill --num-queries 20 --json
- run: npx skilltest eval path/to/skill --prompts path/to/prompts.json --json
- run: npx skilltest check path/to/skill --min-f1 0.8 --min-assert-pass-rate 0.9 --json
npm install
npm run lint
npm run test
npm run build
node dist/index.js --help
npm test runs the Vitest suite. The tests are offline and do not call model
providers.
Manual CLI smoke tests:
node dist/index.js lint test-fixtures/sample-skill/
node dist/index.js lint test-fixtures/sample-skill/ --html lint-report.html
node dist/index.js trigger test-fixtures/sample-skill/ --num-queries 2
node dist/index.js trigger test-fixtures/sample-skill/ --queries path/to/queries.json --seed 123
node dist/index.js eval test-fixtures/sample-skill/ --prompts test-fixtures/eval-prompts.json
node dist/index.js check test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
node dist/index.js improve test-fixtures/sample-skill/ --num-queries 2 --prompts test-fixtures/eval-prompts.json
npm run lint
npm run build
npm run test
npm pack --dry-run
npm publish --dry-run
Then publish:
npm publish
Issues and pull requests are welcome. Include:
SKILL.md or fixtures when relevantMIT
FAQs
The testing framework for Agent Skills. Lint, test triggering, and evaluate your SKILL.md files.
We found that skilltest demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.