@mastra/evals
A comprehensive evaluation framework for assessing AI model outputs across multiple dimensions.
Installation
npm install @mastra/evals
Overview
@mastra/evals
provides a suite of evaluation metrics for assessing AI model outputs. The package includes both LLM-based and NLP-based metrics, enabling both automated and model-assisted evaluation of AI responses.
Features
LLM-Based Metrics
NLP-Based Metrics
-
Completeness
- Analyzes structural completeness of responses
- Identifies missing elements from input requirements
- Provides detailed element coverage analysis
- Tracks input-output element ratios
-
Content Similarity
- Measures text similarity between inputs and outputs
- Configurable for case and whitespace sensitivity
- Returns normalized similarity scores
- Uses string comparison algorithms for accuracy
-
Keyword Coverage
- Tracks presence of key terms from input in output
- Provides detailed keyword matching statistics
- Calculates coverage ratios
- Useful for ensuring comprehensive responses
Usage
Basic Example
import { ContentSimilarityMetric, ToxicityMetric } from '@mastra/evals';
const similarityMetric = new ContentSimilarityMetric({
ignoreCase: true,
ignoreWhitespace: true,
});
const toxicityMetric = new ToxicityMetric({
model: openai('gpt-4'),
scale: 1,
});
const input = 'What is the capital of France?';
const output = 'Paris is the capital of France.';
const similarityResult = await similarityMetric.measure(input, output);
const toxicityResult = await toxicityMetric.measure(input, output);
console.log('Similarity Score:', similarityResult.score);
console.log('Toxicity Score:', toxicityResult.score);
Context-Aware Evaluation
import { FaithfulnessMetric } from '@mastra/evals';
const faithfulnessMetric = new FaithfulnessMetric({
model: openai('gpt-4'),
context: ['Paris is the capital of France', 'Paris has a population of 2.2 million'],
scale: 1,
});
const result = await faithfulnessMetric.measure(
'Tell me about Paris',
'Paris is the capital of France with 2.2 million residents',
);
console.log('Faithfulness Score:', result.score);
console.log('Reasoning:', result.reason);
Metric Results
Each metric returns a standardized result object containing:
score
: Normalized score (typically 0-1)
info
: Detailed information about the evaluation
- Additional metric-specific data (e.g., matched keywords, missing elements)
Some metrics also provide:
reason
: Detailed explanation of the score
verdicts
: Individual judgments that contributed to the final score
Telemetry and Logging
The package includes built-in telemetry and logging capabilities:
- Automatic evaluation tracking through Mastra Storage
- Integration with OpenTelemetry for performance monitoring
- Detailed evaluation traces for debugging
import { attachListeners } from '@mastra/evals';
await attachListeners();
await attachListeners(mastra);
Environment Variables
Required for LLM-based metrics:
OPENAI_API_KEY
: For OpenAI model access
- Additional provider keys as needed (Cohere, Anthropic, etc.)
Package Exports
import { evaluate } from '@mastra/evals';
import { ContentSimilarityMetric } from '@mastra/evals/nlp';
Related Packages
@mastra/core
: Core framework functionality
@mastra/engine
: LLM execution engine
@mastra/mcp
: Model Context Protocol integration