
Security News
ESLint Adds Official Support for Linting HTML
ESLint now supports HTML linting with 48 new rules, expanding its language plugin system to cover more of the modern web development stack.
Evaluate and improve your Retrieval-Augmented Generation (RAG) pipelines with open-rag-eval
, an open-source Python evaluation toolkit.
Evaluating RAG quality can be complex. open-rag-eval
provides a flexible and extensible framework to measure the performance of your RAG system, helping you identify areas for improvement. Its modular design allows easy integration of custom metrics and connectors for various RAG implementations.
Importantly, open-rag-eval's metrics do not require golden chunks or golden answer, making RAG evaluation easy and scalable. This is achieved by utilizing UMBRELA and AutoNuggetizer, techniques originating and researched in Jimmy Lin's lab at UWaterloo.
Out-of-the-box, the toolkit includes:
This guide walks you through an end-to-end evaluation using the toolkit. We'll use Vectara as the example RAG platform and the TRECRAG evaluator.
export OPENAI_API_KEY='your-api-key'
In order to build the library from source, which is the recommended method to follow the sample instructions below you can do:
$ git clone https://github.com/vectara/open-rag-eval.git
$ cd open-rag-eval
$ pip install -e .
If you want to install directly from pip, which is the common method if you want to use the library in your own pipeline instead of running the samples, you can run:
pip install open-rag-eval
After installing the library you can follow instructions below to run a sample evaluation and test out the library end to end.
Create a CSV file that contains the queries (for example queries.csv
), which contains a single column named query
, with each row representing a query you want to test against your RAG system.
Example queries file:
query
What is a blackhole?
How big is the sun?
How many moons does jupiter have?
Edit the eval_config_vectara.yaml file. This file controls the evaluation process, including connector options, evaluator choices, and metric settings.
input_queries
, and fill in the correct values for generated_answers
and eval_results_file
results_folder
connector
section (under options
/query_config
) with your Vectara corpus_key
.In addition, make sure you have VECTARA_API_KEY
and OPENAI_API_KEY
available in your environment. For example:
With everything configured, now is the time to run evaluation! Run the following command from the root folder of open-rag-eval:
python open_rag_eval/run_eval.py --config config_examples/eval_config_vectara.yaml
You should see the evaluation progress on your command line. Once it's done, detailed results will be saved to a local CSV file (in the file listed under eval_results_file
) where you can see the score assigned to each sample along with intermediate output useful for debugging and explainability.
Note that a local plot for each evaluation is also stored in the output folder, under the filename listed as metrics_file
.
You can use the plot_results.py
script to plot results from your eval runs. Multiple different runs can be plotted on the same plot allowing for easy comparison of different configurations or RAG providers:
To plot one result:
python open_rag_eval/plot_results.py results.csv
Or to plot multiple results:
python open_rag_eval/plot_results.py results_1.csv results_2.csv results_3.csv
By default the run_eval.py
script will plot metrics and save them to the results folder.
If you are using RAG outputs from your own pipeline, make sure to put your RAG output in a format that is readable by the toolkit (See data/test_csv_connector.csv
as an example).
Copy vectara_eval_config.yaml
to xxx_eval_config.yaml
(where xxx
is the name of your RAG pipeline) as follows:
input_queries
, results_folder
, generated_answers
and eval_results_file
are properly configured. Specifically the generated answers need to exist in the results folder.With everything configured, now is the time to run evaluation! Run the following command:
python open_rag_eval/run_eval.py --config xxx_eval_config.yaml
and you should see the evaluation progress on your command line. Once it's done, detailed results will be saved to a local CSV file where you can see the score assigned to each sample along with intermediate output useful for debugging and explainability.
You can use the open_rag_eval/plot_results.py
script to plot results from your eval runs. Multiple different runs can be plotted on the same plot allowing for easy comparison of different configurations or RAG providers. For example if the output evaluation results from two runs are saved in open_eval_results_1.csv
and open_eval_results_2.csv
you can plot both of them as follows:
python open_rag_eval/plot_results.py results_1.csv results_2.csv
The visualization in the steps abopve shows you the aggregated metrics across one or more runs of the evaluation on several queries. If you want to deep dive into the results, we have a results viewer which enables easy viewing od the produced metrics CSV where you can look at the intermediate results and detailed breakdown of scores and metrics on a per query basis. To do this:
cd open_rag_eval/viz/
streamlit run visualize.py
Note that you will need to have streamlit installed in your environment (which should be the case if you've installed open-rag-eval). Once you upload your evaluation results CSV (results.csv
by default) you can select a query to view detailed metrics for such as the produced nuggets by the AutoNuggetizer, the UMBRELA scores assigned to each retrieved result and so on.
The open-rag-eval
framework follows these general steps during an evaluation:
input_results
), load them from the specified file.RAGResults
plus the scores assigned by the Evaluator
and its Metrics
. These are typically collected and saved to the output report file.For programmatic integration, the framework provides a Flask-based web server.
Endpoints:
/api/v1/evaluate
: Evaluate a single RAG output provided in the request body./api/v1/evaluate_batch
: Evaluate multiple RAG outputs in a single request.Run the Server:
python open_rag_eval/run_server.py
See the API README for detailed documentation for the API.
Open-RAG-Eval uses a plug-in connector architecture to enable testing various RAG platforms. Out of the box it includes connectors for Vectara, LlamaIndex and Langchain.
Here's how connectors work:
Connector
class, and need to define the fetch_data
method.read_queries
which is helpful in reading the input queries.fetch_data
you simply go through all the queries, one by one, and call the RAG system with that query.results
file, with a N rows per query, where N is the number of passages (or chunks) including these fields
query_id
: a unique ID for the queryquery text
: the actual query text stringpassage
: the passage (aka chunk)passage_id
: a unique ID for this passage (you can use just the passage number as a string)generated_answer
: text of the generated response or answer from your RAG pipeline, including citations in [N] format.See the example results file for an example results file
All 3 existing connectors (Vectara, Langchain and LlamaIndex) provide a good reference for how to implement a connector.
👤 Vectara
Contributions, issues and feature requests are welcome and appreciated!
Feel free to check issues page. You can also take a look at the contributing guide.
Give a ⭐️ if this project helped you!
Copyright © 2025 Vectara.
This project is Apache 2.0 licensed.
FAQs
A Python package for RAG Evaluation
We found that open-rag-eval demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
ESLint now supports HTML linting with 48 new rules, expanding its language plugin system to cover more of the modern web development stack.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.