
Product
Secure Your AI-Generated Code with Socket MCP
Socket MCP brings real-time security checks to AI-generated code, helping developers catch risky dependencies before they enter the codebase.
Build monitoring & evaluation pipelines for complex agents
Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).
Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the Judgment Platform for free and you can export your data and self-host at any time.
We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our setup guide to get started.
Judgeval is created and maintained by Judgment Labs.
๐ TracingAutomatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): tracking inputs/outputs, latency, and cost at every step.Online evals can be applied to traces to measure quality on production data in real-time. Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse. Useful for: โข ๐ Debugging agent runs โข ๐ค Tracking user activity โข ๐ฌ Pinpointing performance bottlenecks | |
๐งช Evals15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.Build custom evaluators that connect with our metric-tracking infrastructure. Useful for: โข โ ๏ธ Unit-testing โข ๐ฌ Experimental prompt testing โข ๐ก๏ธ Online guardrails | |
๐ก MonitoringReal-time performance tracking of your agents in production environments. Track all your metrics in one place.Set up Slack/email alerts for critical metrics and receive notifications when thresholds are exceeded. Useful for: โข๐ Identifying degradation early โข๐ Visualizing performance trends across versions and time | |
๐ DatasetsExport trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc.Run evals on datasets as unit tests or to A/B test different agent configurations. Useful for: โข ๐ Scaled analysis for A/B tests โข ๐๏ธ Filtered collections of agent runtime data | |
๐ก InsightsCluster on your data to reveal common use cases and failure modes.Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes. Useful for: โข๐ฎ Surfacing common inputs that lead to error โข๐ค Investigating agent/user behavior for optimization |
Get started with Judgeval by installing our SDK using pip:
pip install judgeval
Ensure you have your JUDGMENT_API_KEY
and JUDGMENT_ORG_ID
environment variables set to connect to the Judgment platform.
If you don't have keys, create an account on the platform!
Here's how you can quickly start using Judgeval:
Track your agent execution with full observability with just a few lines of code.
Create a file named traces.py
with the following code:
from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI
client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")
@judgment.observe(span_type="tool")
def my_tool():
return "What's the capital of the U.S.?"
@judgment.observe(span_type="function")
def main():
task_input = my_tool()
res = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": f"{task_input}"}]
)
return res.choices[0].message.content
main()
Click here for a more detailed explanation.
You can evaluate your agent's execution to measure quality metrics such as hallucination.
Create a file named evaluate.py
with the following code:
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer
client = JudgmentClient()
example = Example(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)
scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
examples=[example],
scorers=[scorer],
model="gpt-4.1",
)
print(results)
Click here for a more detailed explanation.
Attach performance monitoring on traces to measure the quality of your systems in production.
Using the same traces.py
file we created earlier, modify main
function:
from judgeval.common.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer
from openai import OpenAI
client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")
@judgment.observe(span_type="tool")
def my_tool():
return "Hello world!"
@judgment.observe(span_type="function")
def main():
task_input = my_tool()
res = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": f"{task_input}"}]
).choices[0].message.content
judgment.async_evaluate(
scorers=[AnswerRelevancyScorer(threshold=0.5)],
input=task_input,
actual_output=res,
model="gpt-4.1"
)
print("Online evaluation submitted.")
return res
main()
Click here for a more detailed explanation.
Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
JUDGMENT_API_URL
environmental variable is set to your self-hosted backend endpointHave your own? We're happy to feature it if you create a PR or message us on Discord.
You can access our repo of cookbooks here. Here are some highlights:
A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.
A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.
Detecting and evaluating Personal Identifiable Information (PII) leakage.
Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.
If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
There are many ways to contribute to Judgeval:
FAQs
Judgeval Package
We found that judgeval demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.ย It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket MCP brings real-time security checks to AI-generated code, helping developers catch risky dependencies before they enter the codebase.
Security News
As vulnerability data bottlenecks grow, the federal government is formally investigating NISTโs handling of the National Vulnerability Database.
Research
Security News
Socketโs Threat Research Team has uncovered 60 npm packages using post-install scripts to silently exfiltrate hostnames, IP addresses, DNS servers, and user directories to a Discord-controlled endpoint.