PromptCheck
PromptCheck is a CI-first test harness for LLM prompts. It lets you write tests for LLM outputs and automatically checks them in CI so you can catch prompt regressions early — before they reach users. PromptCheck works with any LLM provider, including OpenAI, Anthropic, and open-source models via Groq, OpenRouter, or local APIs.
PromptCheck is a CI-first test harness for LLM prompts.
Write tests in YAML, gate pull-requests, and see pass/fail summaries posted as
comments—so your prompts don't quietly regress.

Install & Run
pip install promptcheck
promptcheck init
promptcheck run
Need full example? See example/ or the Quick-Start Guide.
Get Started
Ready to dive in?
Why Prompt Testing Matters
LLMs can break without warning — even small prompt changes or model updates can cause major regressions. PromptCheck automates prompt evaluation like unit tests automate code quality.
What does PromptCheck do? Key Concepts in Automated LLM Evaluation
When you tweak a prompt, swap models, or refactor your agent code, PromptCheck runs a battery of tests in CI (Rouge, regex, token-cost, latency, etc.) and fails the pull-request if quality regresses or cost spikes.
Think pytest + coverage, but for LLM output.
Key Features for Effective LLM Testing
- Easy setup — drop a YAML test file, add the GitHub Action, done.
- Multi‑provider — Works with OpenAI, Anthropic, Groq, OpenRouter, or any model you connect via API (more built-in providers coming soon).
- Metrics out‑of‑the‑box — exact/regex match, ROUGE‑L, BLEU (optional), token‑count, latency, cost.
- Readable reports — Action log output and (coming soon) PR comment bot.
run.json artifact produced.
- Fast to extend — write your own metric in <30 lines (standard Python).
| CLI & GitHub Action | âś… | âś… |
| Unlimited history & charts | — | ✅ |
| Slack alerts | — | ✅ |
What it looks like

How the YAML works (tests/*.yaml)
A test file contains a list of test cases. Here's an example structure:
- id: "openrouter_greet_test_001"
name: "OpenRouter Basic Greeting Test"
description: "Tests a basic greeting prompt."
type: "llm_generation"
input_data:
prompt: "Briefly introduce yourself and greet the user."
expected_output:
regex_pattern: ".+"
metric_configs:
- metric: "regex_match"
- metric: "token_count"
parameters:
count_types: ["completion", "total"]
- metric: "latency"
threshold:
value: 10000
- metric: "cost"
model_config:
provider: "openrouter"
model_name: "mistralai/mistral-7b-instruct"
parameters:
temperature: 0.7
max_tokens: 50
timeout_s: 25.0
retry_attempts: 2
tags: ["openrouter", "greeting"]
Add more cases in tests/. Thresholds (like value for latency, or f_score for rouge) are defined within the threshold object of a metric_config.
Installation Options & Development Setup
poetry install
poetry install --extras bleu
Releasing (maintainers)
poetry version <new_version>
poetry build
poetry publish -r testpypi
poetry publish
git tag v<new_version>
git push origin v<new_version>
Documentation
📖 Docs: Quick-Start Guide · YAML Reference (Coming Soon!)
Roadmap
- PR comment bot (✅/❌ matrix in‑line)
- Hosted dashboard (Supabase)
- Async runner for large test suites
- More metrics and LLM provider integrations
Contributing
- Fork & clone the repository.
- Set up your development environment:
poetry install --extras bleu (to include all deps).
- Run tests locally:
poetry run promptcheck run tests/ (or a specific file). Keep it green!
- Make your changes, add tests for new features.
- Open a Pull Request.
Feedback & Questions
Found an issue or have a question? We'd love to hear from you! Please open an issue or start a discussion.
License
License: Business Source License 1.1
PromptCheck is free to use for evaluation and non-production use. For commercial licenses, contact us.
End of Document – keep this file as the project's living reference; version & timestamp changes at top on each major edit.