
Security News
NVD Concedes Inability to Keep Pace with Surging CVE Disclosures in 2025
Security experts warn that recent classification changes obscure the true scope of the NVD backlog as CVE volume hits all-time highs.
ARES is an advanced evaluation framework for Retrieval-Augmented Generation (RAG) systems,
Table of Contents: Installation | Requirements | Quick Start | Citation
ARES is a groundbreaking framework for evaluating Retrieval-Augmented Generation (RAG) models. The automated process combines synthetic data generation with fine-tuned classifiers to efficiently assess context relevance, answer faithfulness, and answer relevance, minimizing the need for extensive human annotations. ARES employs synthetic query generation and Prediction-Powered Inference (PPI), providing accurate evaluations with statistical confidence.
What does ARES assess in RAG models?
ARES conducts a comprehensive evaluation of Retrieval-Augmented Generation (RAG) models, assessing the systems for context relevance, answer faithfulness, and answer relevance. This thorough assessment ensures a complete understanding of the performance of the RAG system.
How does ARES automate the evaluation process?
ARES minimizes the need for human labeling by leveraging fine-tuned classifiers and synthetic data. Its PPI component, Prediction-Powered inference, refines evaluations considering model response variability and provides statistical confidence in the results. By using fine-tuned classifiers and synthetically generated data, ARES cuts down on human labeling needs while providing accurate assessments.
Can ARES handle my custom RAG model?
Yes, ARES is a model-agnostic tool that enables you to generate synthetic queries and answers from your documents. With ARES, you can evaluate these generated queries and answers from your RAG model.
pip install ares-ai
Optional: Initalize OpenAI or TogetherAI API key with the following command:
export OPENAI_API_KEY=<your key here>
export TOGETHER_API_KEY=<your key here>
To implement ARES for scoring your RAG system and comparing to other RAG configurations, you need three components:
To get started with ARES, you'll need to set up your configuration. Below is an example of a configuration for ARES!
Copy-paste each step to see ARES in action!
Use the following command to quickly obtain the necessary files for getting started! This includes the 'few_shot_prompt' file for judge scoring and synthetic query generation, as well as both labeled and unlabeled datasets.
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_judge_scoring.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_synthetic_query_generation.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_unlabeled_output.tsv
OPTIONAL: You can run the following command to get the full NQ dataset! (347 MB)
from ares import ARES
ares = ARES()
ares.KILT_dataset("nq")
# Fetches NQ datasets with ratios including 0.5, 0.6, 0.7, etc.
# For purposes of our quick start guide, we rename nq_ratio_0.5 to nq_unlabeled_output and nq_labeled_output.
To get started with ARES's PPI, you'll need to set up your configuration. Below is an example of a configuration for ARES!
Just copy-paste as you go to see ARES in action!
from ares import ARES
ues_idp_config = {
"in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice" : "gpt-3.5-turbo-0125"
}
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}
ppi_config = {
"evaluation_datasets": ['nq_unlabeled_output.tsv'],
"few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
"llm_judge": "gpt-3.5-turbo-1106",
"labels": ["Context_Relevance_Label"],
"gold_label_path": "nq_labeled_output.tsv",
}
ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)
from ares import ARES
ues_idp_config = {
"in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice" : "gpt-3.5-turbo-0125"
}
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}
from ares import ARES
synth_config = {
"document_filepaths": ["nq_labeled_output.tsv"] ,
"few_shot_prompt_filename": "nq_few_shot_prompt_for_synthetic_query_generation.tsv",
"synthetic_queries_filenames": ["synthetic_queries_1.tsv"],
"documents_sampled": 6189
}
ares_module = ARES(synthetic_query_generator=synth_config)
results = ares_module.generate_synthetic_data()
print(results)
from ares import ARES
classifier_config = {
"training_dataset": ["synthetic_queries_1.tsv"],
"validation_set": ["nq_labeled_output.tsv"],
"label_column": ["Context_Relevance_Label"],
"num_epochs": 10,
"patience_value": 3,
"learning_rate": 5e-6,
"assigned_batch_size": 1,
"gradient_accumulation_multiplier": 32,
}
ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)
Note: This code creates a checkpoint for the trained classifier. Training may take some time. You can download our jointly trained checkpoint on context relevance here!: Download Checkpoint
from ares import ARES
ppi_config = {
"evaluation_datasets": ['nq_unlabeled_output.tsv'],
"checkpoints": ["Context_Relevance_Label_nq_labeled_output_date_time.pt"],
"rag_type": "question_answering",
"labels": ["Context_Relevance_Label"],
"gold_label_path": "nq_labeled_output.tsv",
}
ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)
# Output Should be:
"""
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300
"""
ARES supports vLLM, allowing for local execution of LLM models, offering enhanced privacy and the ability to operate ARES offline. Below are steps to vLLM for ARES's UES/IDP and PPI!
from ares import ARES
ues_idp_config = {
"in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice": "meta-llama/Llama-2-13b-hf", # Specify vLLM model
"vllm": True, # Toggle vLLM to True
"host_url": "http://0.0.0.0:8000/v1" # Replace with server hosting model followed by "/v1"
}
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
from ares import ARES
ppi_config = {
"evaluation_datasets": ['nq_unabeled_output.tsv'],
"few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
"llm_judge": "meta-llama/Llama-2-13b-hf", # Specify vLLM model
"labels": ["Context_Relevance_Label"],
"gold_label_path": "nq_labeled_output.tsv",
"vllm": True, # Toggle vLLM to True
"host_url": "http://0.0.0.0:8000/v1" # Replace with server hosting model followed by "/v1"
}
ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)
For more details, refer to our documentation.
We include synthetic datasets for key experimental results in synthetic_datasets
. The few-shot prompts used for generation and evaluation are included in datasets
. We also include instructions for fine-tuning LLM judges in the paper itself. Please reach out to jonsaadfalcon@stanford.edu or manihani@stanford.edu if you have any further questions.
To cite our work, please use the following Bibtex:
@misc{saadfalcon2023ares,
title={ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems},
author={Jon Saad-Falcon and Omar Khattab and Christopher Potts and Matei Zaharia},
year={2023},
eprint={2311.09476},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Machine requirements
Standard_NC24ads_A100_v4
on Azure)Standard_NC6s_v3
and Standard_NC12s_v3
, and ran into this error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.77 GiB total capacity; 15.12 GiB already allocated; 95.44 MiB free; 15.12 GiB reserved in total by PyTorch)
Machine setup
For example, on an Azure VM running Linux (ubuntu 20.04), you will need to do the following:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b
export PATH="~/miniconda3/bin:$PATH"
conda init
sudo apt-get -y update
sudo apt-get -y upgrade
sudo apt-get -y install build-essential
sudo apt-get -y install libpcre3-dev
sudo apt install ubuntu-drivers-common -y
sudo ubuntu-drivers autoinstall
sudo reboot
nvidia-smi
cd
to ARES folder and follow the rest of the READMEFAQs
ARES is an advanced evaluation framework for Retrieval-Augmented Generation (RAG) systems,
We found that ares-ai demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Security experts warn that recent classification changes obscure the true scope of the NVD backlog as CVE volume hits all-time highs.
Security Fundamentals
Attackers use obfuscation to hide malware in open source packages. Learn how to spot these techniques across npm, PyPI, Maven, and more.
Security News
Join Socket for exclusive networking events, rooftop gatherings, and one-on-one meetings during BSidesSF and RSA 2025 in San Francisco.