Socket
Book a DemoInstallSign in
Socket

@caleblawson/evals

Package Overview
Dependencies
Maintainers
1
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@caleblawson/evals

A comprehensive evaluation framework for assessing AI model outputs across multiple dimensions.

latest
npmnpm
Version
0.10.4
Version published
Maintainers
1
Created
Source

@mastra/evals

A comprehensive evaluation framework for assessing AI model outputs across multiple dimensions.

Installation

npm install @mastra/evals

Overview

@mastra/evals provides a suite of evaluation metrics for assessing AI model outputs. The package includes both LLM-based and NLP-based metrics, enabling both automated and model-assisted evaluation of AI responses.

Features

LLM-Based Metrics

  • Answer Relevancy

    • Evaluates how well an answer addresses the input question
    • Considers uncertainty weighting for more nuanced scoring
    • Returns detailed reasoning for scores
  • Bias Detection

    • Identifies potential biases in model outputs
    • Analyzes opinions and statements for bias indicators
    • Provides explanations for detected biases
    • Configurable scoring scale
  • Context Precision & Relevancy

    • Assesses how well responses use provided context
    • Evaluates accuracy of context usage
    • Measures relevance of context to the response
    • Analyzes context positioning in responses
  • Faithfulness

    • Verifies that responses are faithful to provided context
    • Detects hallucinations or fabricated information
    • Evaluates claims against provided context
    • Provides detailed analysis of faithfulness breaches
  • Prompt Alignment

    • Measures how well responses follow given instructions
    • Evaluates adherence to multiple instruction criteria
    • Provides per-instruction scoring
    • Supports custom instruction sets
  • Toxicity

    • Detects toxic or harmful content in responses
    • Provides detailed reasoning for toxicity verdicts
    • Configurable scoring thresholds
    • Considers both input and output context

NLP-Based Metrics

  • Completeness

    • Analyzes structural completeness of responses
    • Identifies missing elements from input requirements
    • Provides detailed element coverage analysis
    • Tracks input-output element ratios
  • Content Similarity

    • Measures text similarity between inputs and outputs
    • Configurable for case and whitespace sensitivity
    • Returns normalized similarity scores
    • Uses string comparison algorithms for accuracy
  • Keyword Coverage

    • Tracks presence of key terms from input in output
    • Provides detailed keyword matching statistics
    • Calculates coverage ratios
    • Useful for ensuring comprehensive responses

Usage

Basic Example

import { ContentSimilarityMetric, ToxicityMetric } from '@mastra/evals';

// Initialize metrics
const similarityMetric = new ContentSimilarityMetric({
  ignoreCase: true,
  ignoreWhitespace: true,
});

const toxicityMetric = new ToxicityMetric({
  model: openai('gpt-4'),
  scale: 1, // Optional: adjust scoring scale
});

// Evaluate outputs
const input = 'What is the capital of France?';
const output = 'Paris is the capital of France.';

const similarityResult = await similarityMetric.measure(input, output);
const toxicityResult = await toxicityMetric.measure(input, output);

console.log('Similarity Score:', similarityResult.score);
console.log('Toxicity Score:', toxicityResult.score);

Context-Aware Evaluation

import { FaithfulnessMetric } from '@mastra/evals';

// Initialize with context
const faithfulnessMetric = new FaithfulnessMetric({
  model: openai('gpt-4'),
  context: ['Paris is the capital of France', 'Paris has a population of 2.2 million'],
  scale: 1,
});

// Evaluate response against context
const result = await faithfulnessMetric.measure(
  'Tell me about Paris',
  'Paris is the capital of France with 2.2 million residents',
);

console.log('Faithfulness Score:', result.score);
console.log('Reasoning:', result.reason);

Metric Results

Each metric returns a standardized result object containing:

  • score: Normalized score (typically 0-1)
  • info: Detailed information about the evaluation
  • Additional metric-specific data (e.g., matched keywords, missing elements)

Some metrics also provide:

  • reason: Detailed explanation of the score
  • verdicts: Individual judgments that contributed to the final score

Telemetry and Logging

The package includes built-in telemetry and logging capabilities:

  • Automatic evaluation tracking through Mastra Storage
  • Integration with OpenTelemetry for performance monitoring
  • Detailed evaluation traces for debugging
import { attachListeners } from '@mastra/evals';

// Enable basic evaluation tracking
await attachListeners();

// Store evals in Mastra Storage (if storage is enabled)
await attachListeners(mastra);
// Note: When using in-memory storage, evaluations are isolated to the test process.
// When using file storage, evaluations are persisted and can be queried later.

Environment Variables

Required for LLM-based metrics:

  • OPENAI_API_KEY: For OpenAI model access
  • Additional provider keys as needed (Cohere, Anthropic, etc.)

Package Exports

// Main package exports
import { evaluate } from '@mastra/evals';
// NLP-specific metrics
import { ContentSimilarityMetric } from '@mastra/evals/nlp';
  • @mastra/core: Core framework functionality
  • @mastra/engine: LLM execution engine
  • @mastra/mcp: Model Context Protocol integration

FAQs

Package last updated on 20 Jun 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts