![Maven Central Adds Sigstore Signature Validation](https://cdn.sanity.io/images/cgdhsj6q/production/7da3bc8a946cfb5df15d7fcf49767faedc72b483-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Maven Central Adds Sigstore Signature Validation
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
pytest-evals
🚀Test your LLM outputs against examples - no more manual checking! A (minimalistic) pytest plugin that helps you to evaluate that your LLM is giving good answers.
Building LLM applications is exciting, but how do you know they're actually working well? pytest-evals
helps you:
pytest-xdist
and
asynchronously with pytest-asyncio
.To get started, install pytest-evals
and write your tests:
pip install pytest-evals
For example, say you're building a support ticket classifier. You want to test cases like:
Input Text | Expected Classification |
---|---|
My login isn't working and I need to access my account | account_access |
Can I get a refund for my last order? | billing |
How do I change my notification settings? | settings |
pytest-evals
helps you automatically test how your LLM perform against these cases, track accuracy, and ensure it
keeps working as expected over time.
# Predict the LLM performance for each case
@pytest.mark.eval(name="my_classifier")
@pytest.mark.parametrize("case", TEST_DATA)
def test_classifier(case: dict, eval_bag, classifier):
# Run predictions and store results
eval_bag.prediction = classifier(case["Input Text"])
eval_bag.expected = case["Expected Classification"]
eval_bag.accuracy = eval_bag.prediction == eval_bag.expected
# Now let's see how our app performing across all cases...
@pytest.mark.eval_analysis(name="my_classifier")
def test_analysis(eval_results):
accuracy = sum([result.accuracy for result in eval_results]) / len(eval_results)
print(f"Accuracy: {accuracy:.2%}")
assert accuracy >= 0.7 # Ensure our performance is not degrading 🫢
Then, run your evaluation tests:
# Run test cases
pytest --run-eval
# Analyze results
pytest --run-eval-analysis
Evaluations are just tests. No need for complex frameworks or DSLs. pytest-evals
is minimalistic by design:
pytest
- the tool you already knowIt just collects your results and lets you analyze them as a whole. Nothing more, nothing less.
Check out our detailed guides and examples:
Built on top of pytest-harvest, pytest-evals
splits evaluation into
two phases:
eval_bag
. The results are saved in a
temporary file to allow the analysis phase to access them.eval_results
to calculate final metricsThis split allows you to:
--supress-failed-exit-code --run-eval
flags)Note: When running evaluation tests, the rest of your test suite will not run. This is by design to keep the results clean and focused.
By default, pytest-evals
saves the results of each case in a json file to allow the analysis phase to access them.
However, this might not be a friendly format for deeper analysis. To save the results in a more friendly format, as a
CSV file, use the --save-evals-csv
flag:
pytest --run-eval --save-evals-csv
It's also possible to run evaluations from a notebook. To do that, simply install ipytest, and load the extension:
%load_ext pytest_evals
Then, use the magic commands %%ipytest_eval
in your cell to run evaluations. This will run the evaluation phase and
then the analysis phase. By default, using this magic will run both --run-eval
and --run-eval-analysis
, but you can
specify your own flags by passing arguments right after the magic command (e.g., %%ipytest_eval --run-eval
).
%%ipytest_eval
import pytest
@pytest.mark.eval(name="my_eval")
def test_agent(eval_bag):
eval_bag.prediction = agent.run(case["input"])
@pytest.mark.eval_analysis(name="my_eval")
def test_analysis(eval_results):
print(f"F1 Score: {calculate_f1(eval_results):.2%}")
You can see an example of this in the example/example_notebook.ipynb
notebook. Or
look at the advanced example for a more complex example that tracks multiple
experiments.
It's recommended to use a CSV file to store test data. This makes it easier to manage large datasets and allows you to communicate with non-technical stakeholders.
To do this, you can use pandas
to read the CSV file and pass the test cases as parameters to your tests using
@pytest.mark.parametrize
🙃 :
import pandas as pd
import pytest
test_data = pd.read_csv("tests/testdata.csv")
@pytest.mark.eval(name="my_eval")
@pytest.mark.parametrize("case", test_data.to_dict(orient="records"))
def test_agent(case, eval_bag, agent):
eval_bag.prediction = agent.run(case["input"])
In case you need to select a subset of the test data (e.g., a golden set), you can simply define an environment variable
to indicate that, and filter the data with pandas
.
Run tests and analysis as separate steps:
evaluate:
steps:
- run: pytest --run-eval -n auto --supress-failed-exit-code # Run cases in parallel
- run: pytest --run-eval-analysis # Analyze results
Use --supress-failed-exit-code
with --run-eval
- let the analysis phase determine success/failure. If all your
cases pass, your evaluation set is probably too small!
As your evaluation set grows, you may want to run your test cases in parallel. To do this, install
pytest-xdist
. pytest-evals
will support that out of the box 🚀.
run: pytest --run-eval -n auto
Contributions make the open-source community a fantastic place to learn, inspire, and create. Any contributions you make are greatly appreciated (not only code! but also documenting, blogging, or giving us feedback) 😍.
Please fork the repo and create a pull request if you have a suggestion. You can also simply open an issue to give us some feedback.
Don't forget to give the project a star! ⭐️
For more information about contributing code to the project, read the CONTRIBUTING.md guide.
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
A pytest plugin for running and analyzing LLM evaluation tests
We found that pytest-evals demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.