
Product
Introducing Repository Access Permissions and Custom Roles
Socket now supports Custom Roles and Repository Access Permissions so organizations can control who can access specific repositories and actions.
oh-my-knowledge
Advanced tools
Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.
English | 简体中文
Did your prompt actually get better? A/B test your prompts and skills with statistical rigor — bootstrap CI and length-debias on by default, Krippendorff α the moment you add a gold set.
📖 Full documentation: oh-my-knowledge.pages.dev (searchable, English / 简体中文)

npm i -g oh-my-knowledge
omk init demo && cd demo
omk eval --control code-review-v1 --treatment code-review-v2
Runs out of the box — no edits needed first. omk init scaffolds two skill variants and three sample cases; omk eval runs the controlled A/B and opens an HTML report with a one-line verdict in about five minutes. Once it runs, swap in your own skills and cases.
Prerequisite: the default executor and judge use the claude CLI — install and log in first (see Requirements); to use another model or run offline (no API key) see executors.
The first run has only 3 cases, so the verdict will usually be
UNDERPOWERED(insufficient data) — that's a normal starting point, not an error; grow to ~20+ cases before trusting a ship/no-ship call.
The CLI notifies you when a newer version is available (at most once per 20h); set
OMK_SKIP_UPDATE_CHECK=1to silence it permanently.
Walkthrough: 5-minute quickstart guide (recommended for first-time users). More runnable examples (A/B, offline executor, batch, evolve, agent, RAG) live in the repo's example gallery.
Deeper: who omk is for · CLI reference · how it works · eval sample format · executors · artifact layout
Install the official omk Agent Skill to let your coding agent run omk workflows from natural language:
omk install omk-agent-skill
By default, omk installs only into detected local targets it explicitly supports: Codex/AGENTS when ~/.codex or ~/.agents exists, and Claude Code when ~/.claude exists. Use --to all to force every target omk currently knows, or --dest for a custom skill root.
When the omk skill is available in Claude Code, you can invoke it directly:
/omk eval # evaluate the artifact(s) in the current project
/omk evolve # auto-iterate to improve a skill
/omk sample # generate or fill test cases
These slash commands are natural-language entry points — the agent reads the conversation context to figure out which skill to operate on. You can also just say "compare v1 vs v2 for me" or "improve this artifact" and omk picks the right command.
Codex does not support Claude Code style /omk ... slash commands. Ask the agent to run the omk CLI directly:
omk eval
omk evolve skills/my-skill.md # one-shot: doctor → (auto-generate samples if missing) → self-iterate
omk sample skills/my-skill.md
You can also describe the goal in natural language, such as "compare v1 vs v2" or "generate test cases for this skill".
omk evolveis a one-shot loop: it runs the doctor gate first, auto-generates eval samples when the target skill has none, then self-iterates. For a brand-new skill, just runomk evolve skills/foo.md.
Teams doing knowledge engineering produce lots of knowledge artifacts (skills today, but also prompts, agents, workflows…). When someone asks "why is v2 better than v1", you need objective data instead of gut feeling. oh-my-knowledge solves this with controlled experiments: same model, same test samples, only the knowledge artifact changes.
| omk | promptfoo | DeepEval | LangSmith | |
|---|---|---|---|---|
| Bootstrap CI | ✓ default | ✗ | ✗ | ✗ |
| Krippendorff α (judge ↔ human) | ✓ with gold set | ✗ | ✗ | ✗ |
| Length-debias judge prompt | ✓ default | ✗ | ✗ | ✗ |
| Saturation curve | ✓ | ✗ | ✗ | ✗ |
| Three-layer scoring isolation | ✓ | ✗ | partial | ✗ |
| Per-variant skill isolation (construct validity) | ✓ default | ✗ | ✗ | ✗ |
| Native Agent Skill | ✓ | ✗ | ✗ | ✗ |
| Hosted SaaS dashboard | ✗ | ✗ | ✓ | ✓ |
omk's moat is default-on safety net — Bootstrap CI and length-debias aren't advanced flags; they're the default, and judge ↔ human α comes free the moment you add a gold set. Other tools let you opt into confidence intervals; omk makes them unavoidable. Need a hosted SaaS dashboard? Choose LangSmith. Want quick local prompt iteration without statistics? Choose promptfoo. Shipping to production and someone will ask "why should I trust this number?" Choose omk.
RAG-specific evals: see RAGAS (separate niche, complementary to omk). Full comparison with 7 tools across 25+ dimensions: docs/reference/comparison.md.
| Feature | What it does |
|---|---|
| One-line verdict | omk eval six-tier verdict + ship recommendation + exit-code routing; HTML pill shares the same rules |
| Six-dim evaluation | Fact / Behavior / LLM-judge / Cost / Efficiency / Stability shown independently |
| Multi-executor | Claude CLI / Claude SDK / Codex CLI / Codex SDK / OpenAI / Gemini / Anthropic API / any custom command |
| 30+ assertion types | substring, regex, JSON Schema, ROUGE/BLEU/Levenshtein similarity, agent tool-call assertions, semantic similarity, custom JS |
| Statistical rigor | Bootstrap CI / length-debias / saturation curve on by default; Krippendorff α auto-computed with a gold set. Details → |
| RAG metrics | faithfulness / answer_relevancy / context_recall — anti-hallucination + answer relevance + context coverage |
| LLM health audit | omk doctor grades 7 builtin dimensions; --static-only runs offline without an LLM |
| Production observability | parse Claude Code session JSONL traces; measure per-skill failure rate / latency / cost / knowledge-gap signals |
| Knowledge-gap detection | severity-weighted signals quantify risk exposure instead of claiming completeness |
| Construct-validity isolation | --strict-baseline (default ON) cuts three contamination channels so baseline doesn't silently see the skill it's being compared against |
| Git & remote sources | install / eval from a local git ref or a remote git URL (--git-url); directory-skills run in a content-addressed isolated copy so references/ assets are real measured input, not just SKILL.md |
| Evidence-gated management | omk install registers a managed record; omk eval auto-writes evidence bound by content fingerprint, moving a skill installed → measurable; omk list surfaces each managed skill's status (installed / measurable / promoted / stale); omk promote accepts a version once its evidence passes the gate (default PROGRESS only); omk rollback revokes that acceptance, returning the skill to measurable. spec → |
| Sample design science | sample schema with capability / difficulty / construct / provenance metadata (HF Dataset Cards style); studio surfaces coverage breakdown plus rubric_clarity_low / capability_thin flags. docs/specs/sample-design-spec.md |
| Multi-judge ensemble | --judge-models claude:opus,openai-api:gpt-4o cross-vendor scoring + agreement metrics |
| Multi-run variance | --repeat N repeats the eval and computes mean / SD / CI / t-test |
| MCP URL fetching | pull content from private-doc URLs via an MCP server (SSO-protected knowledge bases, etc.) |
| Auto analysis | detects low-discrimination assertions, flat scores, all-pass / all-fail, expensive samples |
| Traceability | reports carry CLI version, Node version, artifact version fingerprint, judge prompt hash |
| EN / ZH switch | one-click language toggle in the HTML report |
The full docs are published at oh-my-knowledge.pages.dev — searchable, with an English / 简体中文 switcher. Key pages:
variant resolves to an artifact + runtime context| Variable | Description |
|---|---|
CCV_PROXY_URL | proxy requests through cc-viewer for live eval-traffic visualization |
OMK_REPORT_PORT | report server port (default: 7799) |
claude CLI (for the default executor and LLM judge; see Claude Code)
--no-judgeThis tool is designed for local trusted environments (dev machines, CI pipelines). The following features execute local code — make sure inputs come from a trusted source:
| Feature | Risk | Scope |
|---|---|---|
Custom assertions (custom) | dynamically loads and executes user-specified .mjs files | only use assertion files you authored or reviewed |
| eval-samples.json | assertion configs can reference external file paths | don't use sample files from untrusted sources |
Recommendations:
See GitHub Releases for release notes. Contributions welcome — see CONTRIBUTING.
FAQs
Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.
The npm package oh-my-knowledge receives a total of 448 weekly downloads. As such, oh-my-knowledge popularity was classified as not popular.
We found that oh-my-knowledge demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Socket now supports Custom Roles and Repository Access Permissions so organizations can control who can access specific repositories and actions.

Product
Socket MCP now lets AI assistants review org alerts, investigate threats using the Socket threat feed, and inspect package files in addition to dependency scoring.

Product
Socket Firewall blocks malicious VS Code and Open VSX extensions before install, protecting developers from compromised editor marketplaces.