New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →

Book a Demo Sign in

novaeval

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

novaeval

A comprehensive, open-source LLM evaluation framework for testing and benchmarking AI models

PyPI

Version: 0.7.0

Maintainers: 1

NovaEval by Noveum.ai

A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.

We're looking for contributors! See the Contributing section below for ways to help.

🤝 We Need Your Help!

NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:

🎯 High-Priority Contribution Areas

We're actively looking for contributors in these key areas:

🧪 Unit Tests: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
📚 Examples: Create real-world evaluation examples and use cases
📝 Guides & Notebooks: Write evaluation guides and interactive Jupyter notebooks
📖 Documentation: Improve API documentation and user guides
🔍 RAG Metrics: Add more metrics specifically for Retrieval-Augmented Generation evaluation
🤖 Agent Evaluation: Build frameworks for evaluating AI agents and multi-turn conversations

🚀 Getting Started as a Contributor

Start Small: Pick up issues labeled good first issue or help wanted
Join Discussions: Share your ideas in GitHub Discussions
Review Code: Help review pull requests and provide feedback
Report Issues: Found a bug? Report it in GitHub Issues
Spread the Word: Star the repository and share with your network

🚀 Features

Multi-Model Support: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
Extensible Scoring: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
Dataset Integration: Support for MMLU, HuggingFace datasets, custom datasets, and more
Production Ready: Docker support, Kubernetes deployment, and cloud integrations
Comprehensive Reporting: Detailed evaluation reports, artifacts, and visualizations
Secure: Built-in credential management and secret store integration
Scalable: Designed for both local testing and large-scale production evaluations
Cross-Platform: Tested on macOS, Linux, and Windows with comprehensive CI/CD

📦 Installation

From PyPI (Recommended)

pip install novaeval

From Source

git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
pip install -e .

Docker

docker pull noveum/novaeval:latest

🏃‍♂️ Quick Start

Basic Evaluation

from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer

# Configure for cost-conscious evaluation
MAX_TOKENS = 100  # Adjust based on budget: 5-10 for answers, 100+ for reasoning

# Initialize components
dataset = MMLUDataset(
    subset="elementary_mathematics",  # Easier subset for demo
    num_samples=10,
    split="test"
)

model = OpenAIModel(
    model_name="gpt-4o-mini",  # Cost-effective model
    temperature=0.0,
    max_tokens=MAX_TOKENS
)

scorer = AccuracyScorer(extract_answer=True)

# Create and run evaluation
evaluator = Evaluator(
    dataset=dataset,
    models=[model],
    scorers=[scorer],
    output_dir="./results"
)

results = evaluator.run()

# Display detailed results
for model_name, model_results in results["model_results"].items():
    for scorer_name, score_info in model_results["scores"].items():
        if isinstance(score_info, dict):
            mean_score = score_info.get("mean", 0)
            count = score_info.get("count", 0)
            print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")

Configuration-Based Evaluation

from novaeval import Evaluator

# Load configuration from YAML/JSON
evaluator = Evaluator.from_config("evaluation_config.yaml")
results = evaluator.run()

Command Line Interface

NovaEval provides a comprehensive CLI for running evaluations:

# Run evaluation from configuration file
novaeval run config.yaml

# Quick evaluation with minimal setup
novaeval quick -d mmlu -m gpt-4 -s accuracy

# List available datasets, models, and scorers
novaeval list-datasets
novaeval list-models
novaeval list-scorers

# Generate sample configuration
novaeval generate-config sample-config.yaml

📖 Complete CLI Reference - Detailed documentation for all CLI commands and options

Example Configuration

# evaluation_config.yaml
dataset:
  type: "mmlu"
  subset: "abstract_algebra"
  num_samples: 500

models:
  - type: "openai"
    model_name: "gpt-4"
    temperature: 0.0
  - type: "anthropic"
    model_name: "claude-3-opus"
    temperature: 0.0

scorers:
  - type: "accuracy"
  - type: "semantic_similarity"
    threshold: 0.8

output:
  directory: "./results"
  formats: ["json", "csv", "html"]
  upload_to_s3: true
  s3_bucket: "my-eval-results"

🌐 HTTP API

NovaEval provides a FastAPI-based HTTP API for programmatic access to evaluation capabilities. This enables easy integration with web applications, microservices, and CI/CD pipelines.

Quick API Start

# Install API dependencies
pip install -e ".[api]"

# Run the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000

# Access interactive documentation
open http://localhost:8000/docs

Core API Endpoints

Health Check: GET /health - Service health status
Component Discovery: GET /api/v1/components/ - List available models, datasets, scorers
Model Operations: POST /api/v1/models/{model}/predict - Generate predictions
Dataset Operations: POST /api/v1/datasets/{dataset}/load - Load and query datasets
Scorer Operations: POST /api/v1/scorers/{scorer}/score - Score predictions
Evaluation Jobs: POST /api/v1/evaluations/submit - Submit async evaluation jobs

Example API Usage

import requests

# Submit evaluation via API
evaluation_config = {
    "name": "api_evaluation",
    "models": [{"provider": "openai", "identifier": "gpt-3.5-turbo"}],
    "datasets": [{"name": "mmlu", "split": "test", "limit": 10}],
    "scorers": [{"name": "accuracy"}]
}

response = requests.post(
    "http://localhost:8000/api/v1/evaluations/submit",
    json=evaluation_config
)

task_id = response.json()["task_id"]
print(f"Evaluation started: {task_id}")

Deployment Options

Docker: docker run -p 8000:8000 novaeval-api:latest
Kubernetes: Full manifests provided in kubernetes/
Cloud Platforms: Supports AWS, GCP, Azure with environment variable configuration

📖 Complete API Documentation - Detailed API reference, examples, and deployment guide

🌐 Noveum Platform Integration

NovaEval includes comprehensive integration with the Noveum Platform API, providing 26 methods for traces, datasets, and scorer results to ease access to the Noveum platform. The NoveumClient provides a unified interface for all platform operations with complete type safety and error handling.

📖 Complete Platform API Documentation - Detailed API reference, examples, and usage patterns

🏗️ Architecture

NovaEval is built with extensibility and modularity in mind:

src/novaeval/
├── datasets/          # Dataset loaders and processors
├── evaluators/        # Core evaluation logic
├── integrations/      # External service integrations
├── models/           # Model interfaces and adapters
├── noveum_platform/  # Noveum Platform API client and integration
├── reporting/        # Report generation and visualization
├── scorers/          # Scoring mechanisms and metrics
└── utils/            # Utility functions and helpers

Core Components

Datasets: Standardized interface for loading evaluation datasets
Models: Unified API for different AI model providers
Scorers: Pluggable scoring mechanisms for various evaluation metrics
Evaluators: Orchestrates the evaluation process
Reporting: Generates comprehensive reports and artifacts
Integrations: Handles external services (S3, credential stores, etc.)
Noveum Platform: Complete API client for traces, datasets, and scorer results

📊 Supported Datasets

MMLU: Massive Multitask Language Understanding
HuggingFace: Any dataset from the HuggingFace Hub
Custom: JSON, CSV, or programmatic dataset definitions
Code Evaluation: Programming benchmarks and code generation tasks
Agent Traces: Multi-turn conversation and agent evaluation

🤖 Supported Models

OpenAI: GPT-3.5, GPT-4, and newer models
Anthropic: Claude family models
AWS Bedrock: Amazon's managed AI services
Noveum AI Gateway: Integration with Noveum's model gateway
Custom: Extensible interface for any API-based model

📏 Built-in Scorers & Metrics

NovaEval provides a comprehensive suite of scorers organized by evaluation domain. All scorers implement the BaseScorer interface and support both synchronous and asynchronous evaluation.

🎯 Accuracy & Classification Metrics

ExactMatchScorer

Purpose: Performs exact string matching between prediction and ground truth
Features:
- Case-sensitive/insensitive matching options
- Whitespace normalization and stripping
- Perfect for classification tasks with exact expected outputs
Use Cases: Multiple choice questions, command validation, exact answer matching
Configuration: case_sensitive, strip_whitespace, normalize_whitespace

AccuracyScorer

Purpose: Advanced classification accuracy with answer extraction capabilities
Features:
- Intelligent answer extraction from model responses using multiple regex patterns
- Support for MMLU-style multiple choice questions (A, B, C, D)
- Letter-to-choice text conversion
- Robust parsing of various answer formats
Use Cases: MMLU evaluations, multiple choice tests, classification benchmarks
Configuration: extract_answer, answer_pattern, choices

F1Scorer

Purpose: Token-level F1 score for partial matching scenarios
Features:
- Calculates precision, recall, and F1 score
- Configurable tokenization (word-level or character-level)
- Case-sensitive/insensitive options
Use Cases: Question answering, text summarization, partial credit evaluation
Returns: Dictionary with precision, recall, f1, and score values

💬 Conversational AI Metrics

KnowledgeRetentionScorer

Purpose: Evaluates if the LLM retains information provided by users throughout conversations
Features:
- Sophisticated knowledge extraction from conversation history
- Sliding window approach for relevant context (configurable window size)
- Detects when LLM asks for previously provided information
- Tracks knowledge items with confidence scores
Use Cases: Chatbots, virtual assistants, multi-turn conversations
Requirements: LLM model for knowledge extraction, conversation context

ConversationRelevancyScorer

Purpose: Measures response relevance to recent conversation context
Features:
- Sliding window context analysis
- LLM-based relevance assessment (1-5 scale)
- Context coherence evaluation
- Conversation flow maintenance tracking
Use Cases: Dialogue systems, context-aware assistants
Configuration: window_size for context scope

ConversationCompletenessScorer

Purpose: Assesses whether user intentions and requests are fully addressed
Features:
- Extracts user intentions from conversation history
- Evaluates fulfillment level of each intention
- Comprehensive coverage analysis
- Outcome-based evaluation
Use Cases: Customer service bots, task-oriented dialogue systems

RoleAdherenceScorer

Purpose: Evaluates consistency with assigned persona or role
Features:
- Role consistency tracking throughout conversations
- Character maintenance assessment
- Persona adherence evaluation
- Customizable role expectations
Use Cases: Character-based chatbots, role-playing AI, specialized assistants
Configuration: expected_role parameter

ConversationalMetricsScorer

Purpose: Comprehensive conversational evaluation combining multiple metrics
Features:
- Combines knowledge retention, relevancy, completeness, and role adherence
- Configurable metric inclusion/exclusion
- Weighted aggregation of individual scores
- Detailed per-metric breakdown
Use Cases: Holistic conversation quality assessment
Configuration: Enable/disable individual metrics, window sizes, role expectations

🔍 RAG (Retrieval-Augmented Generation) Metrics

AnswerRelevancyScorer

Purpose: Evaluates how relevant answers are to given questions
Features:
- Generates questions from answers using LLM
- Semantic similarity comparison using embeddings (SentenceTransformers)
- Multiple question generation for robust evaluation
- Cosine similarity scoring
Use Cases: RAG systems, Q&A applications, knowledge bases
Configuration: threshold, embedding_model

FaithfulnessScorer

Purpose: Measures if responses are faithful to provided context without hallucinations
Features:
- Extracts factual claims from responses
- Verifies each claim against source context
- Three-tier verification: SUPPORTED/PARTIALLY_SUPPORTED/NOT_SUPPORTED
- Detailed claim-by-claim analysis
Use Cases: RAG faithfulness, fact-checking, source attribution
Configuration: threshold for pass/fail determination

ContextualPrecisionScorer

Purpose: Evaluates precision of retrieved context relevance
Features:
- Splits context into chunks for granular analysis
- Relevance scoring per chunk (1-5 scale)
- Intelligent context segmentation
- Average relevance calculation
Use Cases: Retrieval system evaluation, context quality assessment
Requirements: Context must be provided for evaluation

ContextualRecallScorer

Purpose: Measures if all necessary information for answering is present in context
Features:
- Extracts key information from expected outputs
- Checks presence of each key fact in provided context
- Three-tier presence detection: PRESENT/PARTIALLY_PRESENT/NOT_PRESENT
- Comprehensive information coverage analysis
Use Cases: Retrieval completeness, context sufficiency evaluation
Requirements: Both context and expected output required

RAGASScorer

Purpose: Composite RAGAS methodology combining multiple RAG metrics
Features:
- Integrates Answer Relevancy, Faithfulness, Contextual Precision, and Contextual Recall
- Configurable weighted aggregation
- Parallel execution of individual metrics
- Comprehensive RAG pipeline evaluation
Use Cases: Complete RAG system assessment, benchmark evaluation
Configuration: Custom weights for each metric component

🤖 LLM-as-Judge Metrics

GEvalScorer

Purpose: Uses LLMs with chain-of-thought reasoning for custom evaluation criteria
Features:
- Based on G-Eval research paper methodology
- Configurable evaluation criteria and steps
- Chain-of-thought reasoning support
- Multiple evaluation iterations for consistency
- Custom score ranges and thresholds
Use Cases: Custom evaluation criteria, human-aligned assessment, complex judgments
Configuration: criteria, use_cot, num_iterations, threshold

CommonGEvalCriteria (Predefined Criteria)

Correctness: Factual accuracy and completeness assessment
Relevance: Topic adherence and query alignment evaluation
Coherence: Logical flow and structural consistency analysis
Helpfulness: Practical value and actionability assessment

PanelOfJudgesScorer

Purpose: Multi-LLM evaluation with diverse perspectives and aggregation
Features:
- Multiple LLM judges with individual weights and specialties
- Configurable aggregation methods (mean, median, weighted, consensus, etc.)
- Consensus requirement and threshold controls
- Parallel judge evaluation for efficiency
- Detailed individual and aggregate reasoning
Use Cases: High-stakes evaluation, bias reduction, robust assessment
Configuration: Judge models, weights, specialties, aggregation method

SpecializedPanelScorer (Panel Configurations)

Diverse Panel: Different models with varied specialties (accuracy, clarity, completeness)
Consensus Panel: High-consensus requirement for agreement-based decisions
Weighted Expert Panel: Domain experts with expertise-based weighting

🎭 Agent Evaluation Metrics

Tool Relevancy Scoring

Purpose: Evaluates appropriateness of tool calls given available tools
Features: Compares selected tools against available tool catalog
Use Cases: Agent tool selection assessment, action planning evaluation

Tool Correctness Scoring

Purpose: Compares actual tool calls against expected tool calls
Features: Detailed tool call comparison and correctness assessment
Use Cases: Agent behavior validation, expected action verification

Parameter Correctness Scoring

Purpose: Evaluates correctness of parameters passed to tool calls
Features: Parameter validation against tool call results and expectations
Use Cases: Tool usage quality, parameter selection accuracy

Task Progression Scoring

Purpose: Measures agent progress toward assigned tasks
Features: Analyzes task completion status and advancement quality
Use Cases: Agent effectiveness measurement, task completion tracking

Context Relevancy Scoring

Purpose: Assesses response appropriateness given agent's role and task
Features: Role-task-response alignment evaluation
Use Cases: Agent behavior consistency, contextual appropriateness

Role Adherence Scoring

Purpose: Evaluates consistency with assigned agent role across actions
Features: Comprehensive role consistency across tool calls and responses
Use Cases: Agent persona maintenance, role-based behavior validation

Goal Achievement Scoring

Purpose: Measures overall goal accomplishment using complete interaction traces
Features: End-to-end goal evaluation with G-Eval methodology
Use Cases: Agent effectiveness assessment, outcome-based evaluation

Conversation Coherence Scoring

Purpose: Evaluates logical flow and context maintenance in agent conversations
Features: Conversational coherence and context tracking analysis
Use Cases: Agent dialogue quality, conversation flow assessment

AgentScorers (Convenience Class)

Purpose: Unified interface for all agent evaluation metrics
Features: Single class providing access to all agent scorers with consistent LLM model
Methods: Individual scoring methods plus score_all() for comprehensive evaluation

🔧 Advanced Features

BaseScorer Interface

All scorers inherit from BaseScorer providing:

Statistics Tracking: Automatic score history and statistics
Batch Processing: Efficient batch scoring capabilities
Input Validation: Robust input validation and error handling
Configuration Support: Flexible configuration from dictionaries
Metadata Reporting: Detailed scoring metadata and information

ScoreResult Model

Comprehensive scoring results include:

Numerical Score: Primary evaluation score
Pass/Fail Status: Threshold-based binary result
Detailed Reasoning: Human-readable evaluation explanation
Rich Metadata: Additional context and scoring details

📊 Usage Examples

# Basic accuracy scoring
scorer = AccuracyScorer(extract_answer=True)
score = scorer.score("The answer is B", "B")

# Advanced conversational evaluation
conv_scorer = ConversationalMetricsScorer(
    model=your_llm_model,
    include_knowledge_retention=True,
    include_relevancy=True,
    window_size=10
)
result = await conv_scorer.evaluate(input_text, output_text, context=conv_context)

# RAG system evaluation
ragas = RAGASScorer(
    model=your_llm_model,
    weights={"faithfulness": 0.4, "answer_relevancy": 0.3, "contextual_precision": 0.3}
)
result = await ragas.evaluate(question, answer, context=retrieved_context)

# Panel-based evaluation
panel = SpecializedPanelScorer.create_diverse_panel(
    models=[model1, model2, model3],
    evaluation_criteria="overall quality and helpfulness"
)
result = await panel.evaluate(input_text, output_text)

# Agent evaluation
agent_scorers = AgentScorers(model=your_llm_model)
all_scores = agent_scorers.score_all(agent_data)

🚀 Deployment

Local Development

# Install dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run example evaluation
python examples/basic_evaluation.py

Docker

# Build image
docker build -t nova-eval .

# Run evaluation
docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml

Kubernetes

# Deploy to Kubernetes
kubectl apply -f kubernetes/

# Check status
kubectl get pods -l app=nova-eval

🔧 Configuration

NovaEval supports configuration through:

YAML/JSON files: Declarative configuration
Environment variables: Runtime configuration
Python code: Programmatic configuration
CLI arguments: Command-line overrides

Environment Variables

export NOVA_EVAL_OUTPUT_DIR="./results"
export NOVA_EVAL_LOG_LEVEL="INFO"
export OPENAI_API_KEY="your-api-key"
export AWS_ACCESS_KEY_ID="your-aws-key"

CI/CD Integration

NovaEval includes optimized GitHub Actions workflows:

Unit tests run on all PRs and pushes for quick feedback
Integration tests run on main branch only to minimize API costs
Cross-platform testing on macOS, Linux, and Windows

📈 Reporting and Artifacts

NovaEval generates comprehensive evaluation reports:

Summary Reports: High-level metrics and insights
Detailed Results: Per-sample predictions and scores
Visualizations: Charts and graphs for result analysis
Artifacts: Model outputs, intermediate results, and debug information
Export Formats: JSON, CSV, HTML, PDF

Example Report Structure

results/
├── summary.json              # High-level metrics
├── detailed_results.csv      # Per-sample results
├── artifacts/
│   ├── model_outputs/        # Raw model responses
│   ├── intermediate/         # Processing artifacts
│   └── debug/               # Debug information
├── visualizations/
│   ├── accuracy_by_category.png
│   ├── score_distribution.png
│   └── confusion_matrix.png
└── report.html              # Interactive HTML report

🔌 Extending NovaEval

Custom Datasets

from novaeval.datasets import BaseDataset

class MyCustomDataset(BaseDataset):
    def load_data(self):
        # Implement data loading logic
        return samples

    def get_sample(self, index):
        # Return individual sample
        return sample

Custom Scorers

from novaeval.scorers import BaseScorer

class MyCustomScorer(BaseScorer):
    def score(self, prediction, ground_truth, context=None):
        # Implement scoring logic
        return score

Custom Models

from novaeval.models import BaseModel

class MyCustomModel(BaseModel):
    def generate(self, prompt, **kwargs):
        # Implement model inference
        return response

🤝 Contributing

We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our Contributing Guide for detailed guidelines.

🎯 Priority Contribution Areas

As mentioned in the We Need Your Help section, we're particularly looking for help with:

Unit Tests - Expand test coverage beyond the current 23%
Examples - Real-world evaluation scenarios and use cases
Guides & Notebooks - Interactive evaluation tutorials
Documentation - API docs, user guides, and tutorials
RAG Metrics - Specialized metrics for retrieval-augmented generation
Agent Evaluation - Frameworks for multi-turn and agent-based evaluations

Development Setup

# Clone repository
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run with coverage
pytest --cov=src/novaeval --cov-report=html

🏗️ Contribution Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes following our coding standards
Add tests for your changes
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📋 Contribution Guidelines

Code Quality: Follow PEP 8 and use the provided pre-commit hooks
Testing: Add unit tests for new features and bug fixes
Documentation: Update documentation for API changes
Commit Messages: Use conventional commit format
Issues: Reference relevant issues in your PR description

🎉 Recognition

Contributors will be:

Listed in our contributors page
Mentioned in release notes for significant contributions
Invited to join our contributor Discord community

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
Built with modern Python best practices and industry standards
Designed for the AI evaluation community

📞 Support

Documentation: https://noveum.github.io/NovaEval
Issues: GitHub Issues
Discussions: GitHub Discussions
Email: support@noveum.ai

Made with ❤️ by the Noveum.ai team

Keywords

FAQs

What is novaeval?

Is novaeval well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install