🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →

Demo Install Sign in

judgeval

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

judgeval

Judgeval Package

0.0.36

PyPI

Maintainers: 0

Build monitoring & evaluation pipelines for complex agents

🌐 Landing Page • Twitter/X • 💼 LinkedIn • 📚 Docs • 🚀 Demos • 🎮 Discord

Judgeval: open-source testing, monitoring, and optimization for AI agents

Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).

Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the Judgment Platform for free and you can export your data and self-host at any time.

We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our setup guide to get started.

Judgeval is created and maintained by Judgment Labs.

📋 Table of Contents

✨ Features


🔍 Tracing Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): tracking inputs/outputs, latency, and cost at every step. Online evals can be applied to traces to measure quality on production data in real-time. Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse. Useful for: • 🐛 Debugging agent runs • 👤 Tracking user activity • 🔬 Pinpointing performance bottlenecks
🧪 Evals 15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall. Build custom evaluators that connect with our metric-tracking infrastructure. Useful for: • ⚠️ Unit-testing • 🔬 Experimental prompt testing • 🛡️ Online guardrails
📡 Monitoring Real-time performance tracking of your agents in production environments. Track all your metrics in one place. Set up Slack/email alerts for critical metrics and receive notifications when thresholds are exceeded. Useful for: •📉 Identifying degradation early •📈 Visualizing performance trends across versions and time
📊 Datasets Export trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc. Run evals on datasets as unit tests or to A/B test different agent configurations. Useful for: • 🔄 Scaled analysis for A/B tests • 🗃️ Filtered collections of agent runtime data
💡 Insights Cluster on your data to reveal common use cases and failure modes. Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes. Useful for: •🔮 Surfacing common inputs that lead to error •🤖 Investigating agent/user behavior for optimization

🛠️ Installation

Get started with Judgeval by installing our SDK using pip:

pip install judgeval

Ensure you have your JUDGMENT_API_KEY and JUDGMENT_ORG_ID environment variables set to connect to the Judgment platform.

If you don't have keys, create an account on the platform!

🏁 Get Started

Here's how you can quickly start using Judgeval:

🛰️ Tracing

Track your agent execution with full observability with just a few lines of code. Create a file named traces.py with the following code:

from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "What's the capital of the U.S.?"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"{task_input}"}]
    )
    return res.choices[0].message.content

main()

Click here for a more detailed explanation.

📝 Offline Evaluations

You can evaluate your agent's execution to measure quality metrics such as hallucination. Create a file named evaluate.py with the following code:

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()

example = Example(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)

scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4.1",
)
print(results)

Click here for a more detailed explanation.

📡 Online Evaluations

Attach performance monitoring on traces to measure the quality of your systems in production.

Using the same traces.py file we created earlier, modify main function:

from judgeval.common.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"{task_input}"}]
    ).choices[0].message.content

    judgment.async_evaluate(
        scorers=[AnswerRelevancyScorer(threshold=0.5)],
        input=task_input,
        actual_output=res,
        model="gpt-4.1"
    )
    print("Online evaluation submitted.")
    return res

main()

Click here for a more detailed explanation.

🏢 Self-Hosting

Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.

Key Features

Deploy Judgment on your own AWS account
Store data in your own Supabase instance
Access Judgment through your own custom domain

Getting Started

Check out our self-hosting documentation for detailed setup instructions, along with how your self-hosted instance can be accessed
Use the Judgment CLI to deploy your self-hosted environment
After your self-hosted instance is setup, make sure the JUDGMENT_API_URL environmental variable is set to your self-hosted backend endpoint

📚 Cookbooks

Have your own? We're happy to feature it if you create a PR or message us on Discord.

You can access our repo of cookbooks here. Here are some highlights:

Sample Agents

💰 LangGraph Financial QA Agent

A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.

✈️ OpenAI Travel Agent

A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.

Custom Evaluators

🔍 PII Detection

Detecting and evaluating Personal Identifiable Information (PII) leakage.

📧 Cold Email Generation

Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.

⭐ Star Us on GitHub

If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.

❤️ Contributors

There are many ways to contribute to Judgeval:

Submit bug reports and feature requests
Review the documentation and submit Pull Requests to improve it
Speaking or writing about Judgment and letting us know!

FAQs

What is judgeval?

Is judgeval well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

judgeval

Judgeval: open-source testing, monitoring, and optimization for AI agents

📋 Table of Contents

✨ Features

🔍 Tracing

🧪 Evals

📡 Monitoring

📊 Datasets

💡 Insights

🛠️ Installation

🏁 Get Started

🛰️ Tracing

📝 Offline Evaluations

📡 Online Evaluations

🏢 Self-Hosting

Key Features

Getting Started

📚 Cookbooks

Sample Agents

💰 LangGraph Financial QA Agent

✈️ OpenAI Travel Agent

Custom Evaluators

🔍 PII Detection

📧 Cold Email Generation

⭐ Star Us on GitHub

❤️ Contributors

Related posts

NIST Under Federal Audit for NVD Processing Backlog and Delays

60 Malicious npm Packages Leak Network and Host Data in Active Malware Campaign