evalz
> Structured evaluation toolkit for LLM outputs
Overview
evalz
provides structured evaluation tools for assessing LLM outputs across multiple dimensions. Built with TypeScript and integrated with OpenAI and Instructor, it enables both automated evaluation and human-in-the-loop assessment workflows.
Key Capabilities
- 🎯 Model-Graded Evaluation: Leverage LLMs to assess response quality
- 📊 Accuracy Measurement: Compare outputs using semantic and lexical similarity
- 🔍 Context Validation: Evaluate responses against source materials
- ⚖️ Composite Assessment: Combine multiple evaluation types with custom weights
Installation
Install evalz
using your preferred package manager:
npm install evalz openai zod @instructor-ai/instructor
bun add evalz openai zod @instructor-ai/instructor
pnpm add evalz openai zod @instructor-ai/instructor
When to Use evalz
Model-Graded Evaluation
Provides human-like judgment for subjective criteria that can't be measured through pure text comparison
Use when you need qualitative assessment of responses:
- Evaluating RAG system output quality
- Assessing chatbot response appropriateness
- Validating content generation
- Measuring response coherence and fluency
const relevanceEval = createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate relevance and quality from 0-1"
});
Accuracy Evaluation
Gives objective measurements for cases where exact or semantic matching is important
Use for comparing outputs against known correct answers:
- Question-answering system validation
- Translation accuracy measurement
- Fact-checking systems
- Test case validation
const accuracyEval = createAccuracyEvaluator({
weights: {
factual: 0.6,
semantic: 0.4
}
});
Context Evaluation
Measures how well outputs utilize and stay faithful to provided context
Use for assessing responses against source materials:
- RAG system faithfulness
- Document summarization accuracy
- Knowledge extraction validation
- Information retrieval quality
const contextEval = createContextEvaluator({
type: "precision"
});
Composite Evaluation
Provides balanced assessment across multiple dimensions of quality
Use for comprehensive system assessment:
- Production LLM monitoring
- A/B testing prompts and models
- Quality assurance pipelines
- Multi-factor response validation
const compositeEval = createWeightedEvaluator({
evaluators: {
relevance: relevanceEval(),
accuracy: accuracyEval(),
context: contextEval()
},
weights: {
relevance: 0.4,
accuracy: 0.4,
context: 0.2
}
});
Evaluator Types and Data Requirements
Context Evaluator Types
type ContextEvaluatorType = "entities-recall" | "precision" | "recall" | "relevance";
- entities-recall: Measures how well the completion captures named entities from the context
- precision: Evaluates how accurate the completion is compared to the context
- recall: Measures how much relevant information from the context is included
- relevance: Assesses how well the completion relates to the context
Data Requirements by Evaluator Type
Model-Graded Evaluator
type ModelGradedData = {
prompt: string;
completion: string;
expectedCompletion?: string;
}
const modelEval = createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate the response"
});
await modelEval({
data: [{
prompt: "What is TypeScript?",
completion: "TypeScript is a typed superset of JavaScript"
}]
});
Accuracy Evaluator
type AccuracyData = {
completion: string;
expectedCompletion: string;
}
const accuracyEval = createAccuracyEvaluator({
weights: { factual: 0.5, semantic: 0.5 }
});
await accuracyEval({
data: [{
completion: "TypeScript adds types to JavaScript",
expectedCompletion: "TypeScript is JavaScript with type support"
}]
});
Context Evaluator
type ContextData = {
prompt: string;
completion: string;
groundTruth: string;
contexts: string[];
}
const entitiesEval = createContextEvaluator({
type: "entities-recall"
});
const precisionEval = createContextEvaluator({
type: "precision"
});
const recallEval = createContextEvaluator({
type: "recall"
});
const relevanceEval = createContextEvaluator({
type: "relevance"
});
const data = {
prompt: "What did the CEO say about Q3?",
completion: "CEO Jane Smith reported 15% growth in Q3 2023",
groundTruth: "The CEO announced strong Q3 performance",
contexts: [
"CEO Jane Smith presented Q3 results",
"Company saw 15% revenue growth in Q3 2023"
]
};
await entitiesEval({ data: [data] });
await precisionEval({ data: [data] });
await recallEval({ data: [data] });
await relevanceEval({ data: [data] });
Composite Evaluation
const compositeEval = createWeightedEvaluator({
evaluators: {
entities: createContextEvaluator({ type: "entities-recall" }),
accuracy: createAccuracyEvaluator({
weights: {
factual: 0.9,
semantic: 0.1
}
}),
quality: createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate quality"
})
},
weights: {
entities: 0.3,
accuracy: 0.4,
quality: 0.3
}
});
await compositeEval({
data: [{
prompt: "Summarize the earnings call",
completion: "CEO Jane Smith announced 15% growth",
expectedCompletion: "The CEO reported strong growth",
groundTruth: "CEO discussed Q3 performance",
contexts: [
"CEO Jane Smith presented Q3 results",
"Company saw 15% growth in Q3 2023"
]
}]
});
Cookbook
RAG System Evaluation
Evaluate RAG responses for relevance to source documents and factual accuracy.
const ragEvaluator = createWeightedEvaluator({
evaluators: {
entities: createContextEvaluator({
type: "entities-recall"
}),
precision: createContextEvaluator({
type: "precision"
}),
recall: createContextEvaluator({
type: "recall"
}),
relevance: createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate how well the response uses the context"
})
},
weights: {
entities: 0.2,
precision: 0.3,
recall: 0.3,
relevance: 0.2
}
});
const result = await ragEvaluator({
data: [{
prompt: "What are the key financial metrics?",
completion: "Revenue grew 25% to $10M in Q3 2023",
groundTruth: "Q3 2023 saw 25% revenue growth to $10M",
contexts: [
"In Q3 2023, company revenue increased 25% to $10M",
"Operating margins improved to 15%"
]
}]
});
Content Moderation Evaluation
Binary evaluation for content policy compliance, useful for automated content filtering.
const moderationEvaluator = createEvaluator({
client: oai,
model: "gpt-4-turbo",
resultsType: "binary",
evaluationDescription: "Score 1 if content follows all policies (safe, respectful, appropriate), 0 if any violation exists"
});
const moderationResult = await moderationEvaluator({
data: [
{
prompt: "Describe our product benefits",
completion: "Our product helps improve productivity",
expectedCompletion: "Professional product description"
},
{
prompt: "Respond to negative review",
completion: "Your complaint is totally wrong...",
expectedCompletion: "Professional response to feedback"
}
]
});
Student Answer Evaluation
Demonstrates weighted evaluation combining exact matching, semantic understanding, and qualitative assessment.
const gradingEvaluator = createWeightedEvaluator({
evaluators: {
keyTerms: createAccuracyEvaluator({
weights: {
factual: 0.9,
semantic: 0.1
}
}),
understanding: createAccuracyEvaluator({
weights: {
factual: 0.2,
semantic: 0.8
}
}),
quality: createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate answer completeness and clarity 0-1"
})
},
weights: {
keyTerms: 0.3,
understanding: 0.4,
quality: 0.3
}
});
const gradingResult = await gradingEvaluator({
data: [{
prompt: "Explain how photosynthesis works",
completion: "Plants convert sunlight into chemical energy through chlorophyll",
expectedCompletion: "Photosynthesis is the process where plants use chlorophyll to convert sunlight, water, and CO2 into glucose and oxygen"
}]
});
Chatbot Quality Assessment
Monitor chatbot response quality across multiple dimensions.
const chatbotEvaluator = createWeightedEvaluator({
evaluators: {
relevance: createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate how well the response addresses the user's query"
}),
tone: createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate the professionalism and friendliness of the response"
}),
accuracy: createAccuracyEvaluator({
weights: { semantic: 0.8, factual: 0.2 }
})
},
weights: {
relevance: 0.4,
tone: 0.3,
accuracy: 0.3
}
});
const result = await chatbotEvaluator({
data: [{
prompt: "How do I reset my password?",
completion: "You can reset your password by clicking the 'Forgot Password' link on the login page.",
expectedCompletion: "To reset your password, use the 'Forgot Password' option at login.",
contexts: ["Previous support interactions"]
}]
});
Content Generation Pipeline
Evaluate generated content for quality and accuracy.
const contentEvaluator = createWeightedEvaluator({
evaluators: {
quality: createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate clarity, structure, and engagement"
}),
factCheck: createAccuracyEvaluator({
weights: { factual: 1.0 }
}),
citations: createContextEvaluator({
type: "entities-recall"
})
},
weights: {
quality: 0.4,
factCheck: 0.4,
citations: 0.2
}
});
const result = await contentEvaluator({
data: [{
prompt: "Write an article about renewable energy trends",
completion: "Solar and wind power installations increased by 30% in 2023...",
contexts: [
"Global renewable energy deployment grew by 30% year-over-year",
"Solar and wind remained the fastest-growing sectors"
],
groundTruth: "Renewable energy saw significant growth, led by solar and wind"
}]
});
Document Processing System
Evaluate document extraction and summarization quality.
const documentEvaluator = createWeightedEvaluator({
evaluators: {
extraction: createContextEvaluator({
type: "recall"
}),
summary: createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate conciseness and completeness"
}),
accuracy: createAccuracyEvaluator({
weights: { semantic: 0.6, factual: 0.4 }
})
},
weights: {
extraction: 0.4,
summary: 0.3,
accuracy: 0.3
}
});
const result = await documentEvaluator({
data: [{
prompt: "Summarize the quarterly report",
completion: "Q3 revenue grew 25% YoY, driven by new product launches...",
contexts: [
"Revenue increased 25% compared to Q3 2022",
"Growth primarily attributed to successful product launches"
],
groundTruth: "Q3 saw 25% YoY revenue growth due to new products"
}]
});
API Reference
createEvaluator
Creates a basic evaluator for assessing AI-generated content based on custom criteria.
Parameters
• client: OpenAI instance.
• model: OpenAI model to use (e.g., "gpt-4o").
• evaluationDescription: Description guiding the evaluation criteria.
• resultsType
: Type of results to return ("score" or "binary").
• messages
: Additional messages to include in the OpenAI API call.
Example
import { createEvaluator } from "evalz";
import OpenAI from "openai";
const oai = new OpenAI({
apiKey: process.env["OPENAI_API_KEY"],
organization: process.env["OPENAI_ORG_ID"]
});
const evaluator = createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Rate the relevance from 0 to 1."
});
const result = await evaluator({ data: [{ prompt: "Discuss the importance of AI.", completion: "AI is important for future technology.", expectedCompletion: "AI is important for future technology." }] });
console.log(result.scoreResults);
createAccuracyEvaluator
Creates an evaluator that assesses string similarity using a hybrid approach of Levenshtein distance (factual similarity) and semantic embeddings (semantic similarity), with customizable weights.
Parameters
• model
(optional): OpenAI.Embeddings.EmbeddingCreateParams["model"] - The OpenAI embedding model to use defaults to "text-embedding-3-small"
.
• weights
(optional): An object specifying the weights for factual and semantic similarities. Defaults to {
factual: 0.5, semantic: 0.5 }.
Example
import { createAccuracyEvaluator } from "evalz";
const evaluator = createAccuracyEvaluator({
model: "text-embedding-3-small",
weights: { factual: 0.4, semantic: 0.6 }
});
const data = [
{ completion: "Einstein was born in Germany in 1879.", expectedCompletion: "Einstein was born in 1879 in Germany." }
];
const result = await evaluator({ data });
console.log(result.scoreResults);
createWeightedEvaluator
Combines multiple evaluators with specified weights for a comprehensive assessment.
Parameters
• evaluators
: An object mapping evaluator names to evaluator functions.
• weights
: An object mapping evaluator names to their corresponding weights.
Example
import { createWeightedEvaluator } from "evalz";
const weightedEvaluator = createWeightedEvaluator({
evaluators: {
relevance: relevanceEval(),
fluency: fluencyEval(),
completeness: completenessEval()
},
weights: {
relevance: 0.25,
fluency: 0.25,
completeness: 0.5
}
});
const result = await weightedEvaluator({ data: yourResponseData });
console.log(result.scoreResults);
Create Composite Weighted Evaluation
A weighted evaluator that incorporates various evaluation types:
Example
import { createEvaluator, createAccuracyEvaluator, createContextEvaluator, createWeightedEvaluator } from "evalz"
const oai = new OpenAI({
apiKey: process.env["OPENAI_API_KEY"],
organization: process.env["OPENAI_ORG_ID"]
});
const relevanceEval = () => createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Please rate the relevance of the response from 0 (not at all relevant) to 1 (highly relevant), considering whether the AI stayed on topic and provided a reasonable answer."
});
const distanceEval = () => createAccuracyEvaluator({
weights: { factual: 0.5, semantic: 0.5 }
});
const semanticEval = () => createAccuracyEvaluator({
weights: { factual: 0.0, semantic: 1.0 }
});
const fluencyEval = () => createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Please rate the completeness of the response from 0 (not at all complete) to 1 (completely answered), considering whether the AI addressed all parts of the prompt."
});
const completenessEval = () => createEvaluator({
client: oai,
model: "gpt-4-turbo",
evaluationDescription: "Please rate the completeness of the response from 0 (not at all complete) to 1 (completely answered), considering whether the AI addressed all parts of the prompt."
});
const contextEntitiesRecallEval = () => createContextEvaluator({ type: "entities-recall" });
const contextPrecisionEval = () => createContextEvaluator({ type: "precision" });
const contextRecallEval = () => createContextEvaluator({ type: "recall" });
const contextRelevanceEval = () => createContextEvaluator({ type: "relevance" });
const compositeWeightedEvaluator = createWeightedEvaluator({
evaluators: {
relevance: relevanceEval(),
fluency: fluencyEval(),
completeness: completenessEval(),
accuracy: createAccuracyEvaluator({ weights: { factual: 0.6, semantic: 0.4 } }),
contextPrecision: contextPrecisionEval()
},
weights: {
relevance: 0.2,
fluency: 0.2,
completeness: 0.2,
accuracy: 0.2,
contextPrecision: 0.2
}
});
const data = [
{
prompt: "When was the first super bowl?",
completion: "The first super bowl was held on January 15, 1967.",
expectedCompletion: "The first superbowl was held on January 15, 1967.",
contexts: ["The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."],
groundTruth: "The first superbowl was held on January 15, 1967."
}
];
const result = await compositeWeightedEvaluator({ data });
console.log(result.scoreResults);
createContextEvaluator
Creates an evaluator that assesses context-based criteria such as relevance, precision, recall, and entities recall.
Parameters
• type
: "entities-recall" | "precision" | "recall" | "relevance" - The type of context evaluation to perform.
• model
(optional): OpenAI.Embeddings.EmbeddingCreateParams["model"] - The OpenAI embedding model to use. Defaults to "text-embedding-3-small"
.
Example
import { createContextEvaluator } from "evalz";
const entitiesRecallEvaluator = createContextEvaluator({ type: "entities-recall" });
const precisionEvaluator = createContextEvaluator({ type: "precision" });
const recallEvaluator = createContextEvaluator({ type: "recall" });
const relevanceEvaluator = createContextEvaluator({ type: "relevance" });
const data = [
{
prompt: "When was the first super bowl?",
completion: "The first superbowl was held on January 15, 1967.",
groundTruth: "The first superbowl was held on January 15, 1967.",
contexts: [
"The First AFL–NFL World Championship Game was an American football game played on January 15, 1967 at the Los Angeles Memorial Coliseum in Los Angeles.",
"This first championship game is retroactively referred to as Super Bowl I."
]
}
];
const result1 = await entitiesRecallEvaluator({ data });
console.log(result1.scoreResults);
const result2 = await precisionEvaluator({ data });
console.log(result2.scoreResults);
const result3 = await recallEvaluator({ data });
console.log(result3.scoreResults);
const result4 = await relevanceEvaluator({ data });
console.log(result4.scoreResults);
Integration with Island AI
Part of the Island AI toolkit:
Contributing
We welcome contributions! Check out:
License
MIT © hack.dance