🚀 Socket Launch Week Day 4:Socket MCP Adds Org Alerts, Threat Feed Review, and Package Inspection.Learn more
Sign In

oh-my-knowledge

Package Overview
Dependencies
Maintainers
1
Versions
28
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

oh-my-knowledge

Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.

latest
Source
npmnpm
Version
0.42.0
Version published
Weekly downloads
587
47.86%
Maintainers
1
Weekly downloads
 
Created
Source

oh-my-knowledge

npm version npm weekly downloads CI License: MIT Node.js Version

English | 简体中文

Did your prompt actually get better? A/B test your prompts and skills with statistical rigor — bootstrap CI and length-debias on by default, Krippendorff α the moment you add a gold set.

📖 Full documentation: oh-my-knowledge.pages.dev (searchable, English / 简体中文)

omk report — verdict pill "v2 is clearly better than v1 — ready to ship"

Quick start

npm i -g oh-my-knowledge
omk init demo && cd demo
omk eval --control code-review-v1 --treatment code-review-v2

Runs out of the box — no edits needed first. omk init scaffolds two skill variants and three sample cases; omk eval runs the controlled A/B and opens an HTML report with a one-line verdict in about five minutes. Once it runs, swap in your own skills and cases.

Prerequisite: the default executor and judge use the claude CLI — install and log in first (see Requirements); to use another model or run offline (no API key) see executors.

The first run has only 3 cases, so the verdict will usually be UNDERPOWERED (insufficient data) — that's a normal starting point, not an error; grow to ~20+ cases before trusting a ship/no-ship call.

The CLI notifies you when a newer version is available (at most once per 20h); set OMK_SKIP_UPDATE_CHECK=1 to silence it permanently.

Walkthrough: 5-minute quickstart guide (recommended for first-time users). More runnable examples (A/B, offline executor, batch, evolve, agent, RAG) live in the repo's example gallery.

Deeper: who omk is for · CLI reference · how it works · eval sample format · executors · artifact layout

Use inside AI Coding Agents

Install the official omk Agent Skill to let your coding agent run omk workflows from natural language:

omk install omk-agent-skill

By default, omk installs only into detected local targets it explicitly supports: Codex/AGENTS when ~/.codex or ~/.agents exists, and Claude Code when ~/.claude exists. Use --to all to force every target omk currently knows, or --dest for a custom skill root.

Use inside Claude Code

When the omk skill is available in Claude Code, you can invoke it directly:

/omk eval              # evaluate the artifact(s) in the current project
/omk evolve            # auto-iterate to improve a skill
/omk sample            # generate or fill test cases

These slash commands are natural-language entry points — the agent reads the conversation context to figure out which skill to operate on. You can also just say "compare v1 vs v2 for me" or "improve this artifact" and omk picks the right command.

Use inside Codex

Codex does not support Claude Code style /omk ... slash commands. Ask the agent to run the omk CLI directly:

omk eval
omk evolve skills/my-skill.md   # one-shot: doctor → (auto-generate samples if missing) → self-iterate
omk sample skills/my-skill.md

You can also describe the goal in natural language, such as "compare v1 vs v2" or "generate test cases for this skill".

omk evolve is a one-shot loop: it runs the doctor gate first, auto-generates eval samples when the target skill has none, then self-iterates. For a brand-new skill, just run omk evolve skills/foo.md.

Why this tool

Teams doing knowledge engineering produce lots of knowledge artifacts (skills today, but also prompts, agents, workflows…). When someone asks "why is v2 better than v1", you need objective data instead of gut feeling. oh-my-knowledge solves this with controlled experiments: same model, same test samples, only the knowledge artifact changes.

Why omk over alternatives

omkpromptfooDeepEvalLangSmith
Bootstrap CI✓ default
Krippendorff α (judge ↔ human)✓ with gold set
Length-debias judge prompt✓ default
Saturation curve
Three-layer scoring isolationpartial
Per-variant skill isolation (construct validity)✓ default
Native Agent Skill
Hosted SaaS dashboard

omk's moat is default-on safety net — Bootstrap CI and length-debias aren't advanced flags; they're the default, and judge ↔ human α comes free the moment you add a gold set. Other tools let you opt into confidence intervals; omk makes them unavoidable. Need a hosted SaaS dashboard? Choose LangSmith. Want quick local prompt iteration without statistics? Choose promptfoo. Shipping to production and someone will ask "why should I trust this number?" Choose omk.

RAG-specific evals: see RAGAS (separate niche, complementary to omk). Full comparison with 7 tools across 25+ dimensions: docs/reference/comparison.md.

Features

FeatureWhat it does
One-line verdictomk eval six-tier verdict + ship recommendation + exit-code routing; HTML pill shares the same rules
Six-dim evaluationFact / Behavior / LLM-judge / Cost / Efficiency / Stability shown independently
Multi-executorClaude CLI / Claude SDK / Codex CLI / Codex SDK / OpenAI / Gemini / Anthropic API / any custom command
30+ assertion typessubstring, regex, JSON Schema, ROUGE/BLEU/Levenshtein similarity, agent tool-call assertions, semantic similarity, custom JS
Statistical rigorBootstrap CI / length-debias / saturation curve on by default; Krippendorff α auto-computed with a gold set. Details →
RAG metricsfaithfulness / answer_relevancy / context_recall — anti-hallucination + answer relevance + context coverage
LLM health auditomk doctor grades 7 builtin dimensions; --static-only runs offline without an LLM
Production observabilityparse Claude Code session JSONL traces; measure per-skill failure rate / latency / cost / knowledge-gap signals
Knowledge-gap detectionseverity-weighted signals quantify risk exposure instead of claiming completeness
Construct-validity isolation--strict-baseline (default ON) cuts three contamination channels so baseline doesn't silently see the skill it's being compared against
Git & remote sourcesinstall / eval from a local git ref or a remote git URL (--git-url); directory-skills run in a content-addressed isolated copy so references/ assets are real measured input, not just SKILL.md
Evidence-gated managementomk install registers a managed record; omk eval auto-writes evidence bound by content fingerprint, moving a skill installed → measurable; omk list surfaces each managed skill's status (installed / measurable / promoted / stale); omk promote accepts a version once its evidence passes the gate (default PROGRESS only); omk rollback revokes that acceptance, returning the skill to measurable. spec →
Sample design sciencesample schema with capability / difficulty / construct / provenance metadata (HF Dataset Cards style); studio surfaces coverage breakdown plus rubric_clarity_low / capability_thin flags. docs/specs/sample-design-spec.md
Multi-judge ensemble--judge-models claude:opus,openai-api:gpt-4o cross-vendor scoring + agreement metrics
Multi-run variance--repeat N repeats the eval and computes mean / SD / CI / t-test
MCP URL fetchingpull content from private-doc URLs via an MCP server (SSO-protected knowledge bases, etc.)
Auto analysisdetects low-discrimination assertions, flat scores, all-pass / all-fail, expensive samples
Traceabilityreports carry CLI version, Node version, artifact version fingerprint, judge prompt hash
EN / ZH switchone-click language toggle in the HTML report

Documentation

The full docs are published at oh-my-knowledge.pages.dev — searchable, with an English / 简体中文 switcher. Key pages:

  • How it works — interleaved scheduling, variant resolution, dual-channel scoring, six-dim report
  • Eval sample format — sample schema, scoring formulas, 30+ assertion types, custom JS assertions
  • CLI reference — all top-level commands with bash examples and flag tables
  • Executors & artifact layout — built-in / custom executors; how variant resolves to an artifact + runtime context
  • How-to guidesevaluate an agent (project runtime context) and use non-Claude models (GLM / Qwen / DeepSeek / Moonshot / Ollama)
  • Quickstart — first-time five-minute walkthrough
  • Example gallery — a set of runnable examples in the repo, arranged simplest-to-richest
  • Sample design spec — capability / construct / provenance metadata; industry-gap mapping
  • Statistical rigor — why bootstrap CI / α / length-debias / saturation matter
  • Comparison with 7 tools — 25+ dimensions across promptfoo / DeepEval / RAGAS / OpenAI Evals / LangSmith / lm-eval-harness / inspect-ai
  • Evidence-gated management — managed records, lifecycle states (installed / measurable / promoted / stale), install → eval → measurable → promote → rollback

Environment variables

VariableDescription
CCV_PROXY_URLproxy requests through cc-viewer for live eval-traffic visualization
OMK_REPORT_PORTreport server port (default: 7799)

Requirements

  • Node.js >= 22
  • claude CLI (for the default executor and LLM judge; see Claude Code)
    • not needed if you use other executors (openai-api / anthropic-api / gemini) with --no-judge

Security notice

This tool is designed for local trusted environments (dev machines, CI pipelines). The following features execute local code — make sure inputs come from a trusted source:

FeatureRiskScope
Custom assertions (custom)dynamically loads and executes user-specified .mjs filesonly use assertion files you authored or reviewed
eval-samples.jsonassertion configs can reference external file pathsdon't use sample files from untrusted sources

Recommendations:

  • Do not expose the local report server on the public internet (no auth)
  • Don't use third-party eval-samples you haven't vetted
  • Custom assertions have a 30-second timeout but no sandbox isolation

See GitHub Releases for release notes. Contributions welcome — see CONTRIBUTING.

Keywords

llm-evaluation

FAQs

Package last updated on 18 Jun 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts