
Security News
Insecure Agents Podcast: Certified Patches, Supply Chain Security, and AI Agents
Socket CEO Feross Aboukhadijeh joins Insecure Agents to discuss CVE remediation and why supply chain attacks require a different security approach.
@caleblawson/evals
Advanced tools
A comprehensive evaluation framework for assessing AI model outputs across multiple dimensions.
A comprehensive evaluation framework for assessing AI model outputs across multiple dimensions.
npm install @mastra/evals
@mastra/evals provides a suite of evaluation metrics for assessing AI model outputs. The package includes both LLM-based and NLP-based metrics, enabling both automated and model-assisted evaluation of AI responses.
Answer Relevancy
Bias Detection
Context Precision & Relevancy
Faithfulness
Prompt Alignment
Toxicity
Completeness
Content Similarity
Keyword Coverage
import { ContentSimilarityMetric, ToxicityMetric } from '@mastra/evals';
// Initialize metrics
const similarityMetric = new ContentSimilarityMetric({
ignoreCase: true,
ignoreWhitespace: true,
});
const toxicityMetric = new ToxicityMetric({
model: openai('gpt-4'),
scale: 1, // Optional: adjust scoring scale
});
// Evaluate outputs
const input = 'What is the capital of France?';
const output = 'Paris is the capital of France.';
const similarityResult = await similarityMetric.measure(input, output);
const toxicityResult = await toxicityMetric.measure(input, output);
console.log('Similarity Score:', similarityResult.score);
console.log('Toxicity Score:', toxicityResult.score);
import { FaithfulnessMetric } from '@mastra/evals';
// Initialize with context
const faithfulnessMetric = new FaithfulnessMetric({
model: openai('gpt-4'),
context: ['Paris is the capital of France', 'Paris has a population of 2.2 million'],
scale: 1,
});
// Evaluate response against context
const result = await faithfulnessMetric.measure(
'Tell me about Paris',
'Paris is the capital of France with 2.2 million residents',
);
console.log('Faithfulness Score:', result.score);
console.log('Reasoning:', result.reason);
Each metric returns a standardized result object containing:
score: Normalized score (typically 0-1)info: Detailed information about the evaluationSome metrics also provide:
reason: Detailed explanation of the scoreverdicts: Individual judgments that contributed to the final scoreThe package includes built-in telemetry and logging capabilities:
import { attachListeners } from '@mastra/evals';
// Enable basic evaluation tracking
await attachListeners();
// Store evals in Mastra Storage (if storage is enabled)
await attachListeners(mastra);
// Note: When using in-memory storage, evaluations are isolated to the test process.
// When using file storage, evaluations are persisted and can be queried later.
Required for LLM-based metrics:
OPENAI_API_KEY: For OpenAI model access// Main package exports
import { evaluate } from '@mastra/evals';
// NLP-specific metrics
import { ContentSimilarityMetric } from '@mastra/evals/nlp';
@mastra/core: Core framework functionality@mastra/engine: LLM execution engine@mastra/mcp: Model Context Protocol integrationFAQs
A comprehensive evaluation framework for assessing AI model outputs across multiple dimensions.
We found that @caleblawson/evals demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Socket CEO Feross Aboukhadijeh joins Insecure Agents to discuss CVE remediation and why supply chain attacks require a different security approach.

Security News
Tailwind Labs laid off 75% of its engineering team after revenue dropped 80%, as LLMs redirect traffic away from documentation where developers discover paid products.

Security News
The planned feature introduces a review step before releases go live, following the Shai-Hulud attacks and a rocky migration off classic tokens that disrupted maintainer workflows.