๐Ÿš€ Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more โ†’
Socket
DemoInstallSign in
Socket

judgeval

Package Overview
Dependencies
Maintainers
0
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

judgeval

Judgeval Package

0.0.36
PyPI
Maintainers
0
Judgment Logo Judgment Logo

Build monitoring & evaluation pipelines for complex agents

Judgment Platform Experiments Page

๐ŸŒ Landing Page โ€ข Twitter/X โ€ข ๐Ÿ’ผ LinkedIn โ€ข ๐Ÿ“š Docs โ€ข ๐Ÿš€ Demos โ€ข ๐ŸŽฎ Discord

Judgeval: open-source testing, monitoring, and optimization for AI agents

Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).

Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the Judgment Platform for free and you can export your data and self-host at any time.

We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our setup guide to get started.

Judgeval is created and maintained by Judgment Labs.

๐Ÿ“‹ Table of Contents

โœจ Features

๐Ÿ” Tracing

Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): tracking inputs/outputs, latency, and cost at every step.

Online evals can be applied to traces to measure quality on production data in real-time.

Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse.

Useful for:
โ€ข ๐Ÿ› Debugging agent runs
โ€ข ๐Ÿ‘ค Tracking user activity
โ€ข ๐Ÿ”ฌ Pinpointing performance bottlenecks

Tracing visualization

๐Ÿงช Evals

15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.

Build custom evaluators that connect with our metric-tracking infrastructure.

Useful for:
โ€ข โš ๏ธ Unit-testing
โ€ข ๐Ÿ”ฌ Experimental prompt testing
โ€ข ๐Ÿ›ก๏ธ Online guardrails

Evaluation metrics

๐Ÿ“ก Monitoring

Real-time performance tracking of your agents in production environments. Track all your metrics in one place.

Set up Slack/email alerts for critical metrics and receive notifications when thresholds are exceeded.

Useful for:
โ€ข๐Ÿ“‰ Identifying degradation early
โ€ข๐Ÿ“ˆ Visualizing performance trends across versions and time

Monitoring Dashboard

๐Ÿ“Š Datasets

Export trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc.

Run evals on datasets as unit tests or to A/B test different agent configurations.

Useful for:
โ€ข ๐Ÿ”„ Scaled analysis for A/B tests
โ€ข ๐Ÿ—ƒ๏ธ Filtered collections of agent runtime data

Dataset management

๐Ÿ’ก Insights

Cluster on your data to reveal common use cases and failure modes.

Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes.

Useful for:
โ€ข๐Ÿ”ฎ Surfacing common inputs that lead to error
โ€ข๐Ÿค– Investigating agent/user behavior for optimization

Insights dashboard

๐Ÿ› ๏ธ Installation

Get started with Judgeval by installing our SDK using pip:

pip install judgeval

Ensure you have your JUDGMENT_API_KEY and JUDGMENT_ORG_ID environment variables set to connect to the Judgment platform.

If you don't have keys, create an account on the platform!

๐Ÿ Get Started

Here's how you can quickly start using Judgeval:

๐Ÿ›ฐ๏ธ Tracing

Track your agent execution with full observability with just a few lines of code. Create a file named traces.py with the following code:

from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "What's the capital of the U.S.?"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"{task_input}"}]
    )
    return res.choices[0].message.content

main()

Click here for a more detailed explanation.

๐Ÿ“ Offline Evaluations

You can evaluate your agent's execution to measure quality metrics such as hallucination. Create a file named evaluate.py with the following code:

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()

example = Example(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)

scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4.1",
)
print(results)

Click here for a more detailed explanation.

๐Ÿ“ก Online Evaluations

Attach performance monitoring on traces to measure the quality of your systems in production.

Using the same traces.py file we created earlier, modify main function:

from judgeval.common.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"{task_input}"}]
    ).choices[0].message.content

    judgment.async_evaluate(
        scorers=[AnswerRelevancyScorer(threshold=0.5)],
        input=task_input,
        actual_output=res,
        model="gpt-4.1"
    )
    print("Online evaluation submitted.")
    return res

main()

Click here for a more detailed explanation.

๐Ÿข Self-Hosting

Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.

Key Features

  • Deploy Judgment on your own AWS account
  • Store data in your own Supabase instance
  • Access Judgment through your own custom domain

Getting Started

  • Check out our self-hosting documentation for detailed setup instructions, along with how your self-hosted instance can be accessed
  • Use the Judgment CLI to deploy your self-hosted environment
  • After your self-hosted instance is setup, make sure the JUDGMENT_API_URL environmental variable is set to your self-hosted backend endpoint

๐Ÿ“š Cookbooks

Have your own? We're happy to feature it if you create a PR or message us on Discord.

You can access our repo of cookbooks here. Here are some highlights:

Sample Agents

๐Ÿ’ฐ LangGraph Financial QA Agent

A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.

โœˆ๏ธ OpenAI Travel Agent

A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.

Custom Evaluators

๐Ÿ” PII Detection

Detecting and evaluating Personal Identifiable Information (PII) leakage.

๐Ÿ“ง Cold Email Generation

Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.

โญ Star Us on GitHub

If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.

โค๏ธ Contributors

There are many ways to contribute to Judgeval:

Contributors

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts