Bundle of Perceval backends for Weblate.
LangEvals boilerplate example evaluator for LLMs.
LangEvals OpenAI moderation evaluator for LLM outputs.
Python evaluator for jQuery-QueryBuilder rules
Retrieve and Evaluate with X(any) models
LangEvals Azure Content Safety evaluator for LLM outputs.
eval async code from sync
To the moon!
Amazon Foundation Model Evaluations
Easily benchmark Machine Learning models on selected tasks and datasets
A package for evaluating the performance of language models with Prometheus
Evaluation metric codes for various vision tasks.
LangEvals lingua evaluator for language detection.
An evaluation package for LLM input output
Query metadatdata from sdists / bdists / installed packages. Safer fork of pkginfo to avoid doing arbitrary imports and eval()
Evaluating global level Ellipsis to useful code.
This package is written for the evaluation of speech super-resolution algorithms.
A simple, safe single expression evaluator library.
Reco evaluation tool
An evaluation abstraction for Keras models.
In-loop evaluation tasks for language modeling
eval expression
Library for validation and formatting of Brazilian data
A package for benchmarking time series machine learning tools.
Python code evaluation system and submissions server capable of unit tests, tracing, and AST inspection. Server can run on Python 2.7 but evaluation requires 3.7+.
ranx: A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion
PICAI Evaluation
A package for COIR evaluations
Evaluation framework for DataBench
Simple tool to provide automation to assessment processes.
Visualize OpenAI evals with Zeno
Evaluate shell command or python code in sphinx and myst
A small example package
Finetune_Eval_Harness
Automatic lyrics transcription evaluation toolkit
(Threshold-Independent) Evaluation of Sound Event Detection Scores
Package for Evaluation of Synthetic Tabular Data Quality
Evaluation and benchmark for Generative AI
xaif package
A redis semaphore implementation using eval scripts
Basic KGE and NSE evaluation for the NGIAB project
A tool to quantify the replicability and reproducibility of system-oriented IR experiments.
An API for using metric models (either provided by default or fine-tuned yourself) to evaluate LLMs.
Evaluate Digitalization Data