🚀 Socket Launch Week 🚀 Day 5: Introducing Socket Fix.Learn More →

vidore-benchmark

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

vidore-benchmark

Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.

5.0.0

PyPI

Maintainers: 2

Vision Document Retrieval (ViDoRe): Benchmark 👀

[Model card] [ViDoRe Leaderboard] [Demo] [Blog Post]

Approach

The Visual Document Retrieval Benchmark (ViDoRe), is introduced to evaluate the performance of document retrieval systems on visually rich documents across various tasks, domains, languages, and settings. It was used to evaluate the ColPali model, a VLM-powered retriever that efficiently retrieves documents based on their visual content and textual queries using a late-interaction mechanism.

ViDoRe Examples

Usage

This packages comes with a Python API and a CLI to evaluate your own retriever on the ViDoRe benchmark. Both are compatible with Python>=3.9.

CLI mode

pip install vidore-benchmark

To keep this package lightweight, only the essential packages were installed. Thus, you must specify the dependency groups for models you want to evaluate with CLI (see the list in pyproject.toml). For instance, if you are going to evaluate the ColVision models (e.g. ColPali, ColQwen2, ColSmol, ...), you should run:

pip install "vidore-benchmark[colpali-engine]"

[!WARNING] If possible, do not pip install colpali-engine directly in the env dedicated for the CLI.

In particular, make sure not to install both vidore-benchmark[colpali-engine] and colpali-engine[train] simultaneously, as it will lead to a circular depencency conflict.

If you want to install all the dependencies for all the models, you can run:

pip install "vidore-benchmark[all-retrievers]"

Note that in order to use BM25Retriever, you will need to download the nltk resources too:

pip install "vidore-benchmark[bm25]"
python -m nltk.downloader punkt punkt_tab stopwords

Library mode

Install the base package using pip:

pip install vidore-benchmark

Command-line usage

Evaluate a retriever on ViDoRE

You can evaluate any off-the-shelf retriever on the ViDoRe benchmark. For instance, you can evaluate the ColPali model on the ViDoRe benchmark to reproduce the results from our paper.

vidore-benchmark evaluate-retriever \
    --model-class colpali \
    --model-name vidore/colpali-v1.3 \
    --collection-name vidore/vidore-benchmark-667173f98e70a1c0fa4db00d \
    --dataset-format qa \
    --split test

Alternatively, you can evaluate your model on a single dataset. If your retriver uses visual embeddings, you can use any dataset path from the ViDoRe Benchmark collection, e.g.:

vidore-benchmark evaluate-retriever \
    --model-class colpali \
    --model-name vidore/colpali-v1.3 \
    --dataset-name vidore/docvqa_test_subsampled \
    --dataset-format qa \
    --split test

If you want to evaluate a retriever that relies on pure-text retrieval (no visual embeddings), you should use the datasets from the ViDoRe Chunk OCR (baseline) instead:

vidore-benchmark evaluate-retriever \
    --model-class bge-m3 \
    --model-name BAAI/bge-m3 \
    --dataset-name vidore/docvqa_test_subsampled_tesseract \
    --dataset-format qa \
    --split test

All the above scripts will generate a JSON file in outputs/{model_id}_metrics.json. Follow the instructions on the ViDoRe Leaderboard to learn how to publish your results on the leaderboard too!

[!NOTE] The vidore-benchmark package supports two formats of datasets:

QA: The dataset is formatted as a question-answering task, where the queries are questions and the passages are the image pages that provide the answers.

BEIR: Following the BEIR paper, the dataset is formatted in 3 sub-datasets: corpus, queries, and qrels. The corpus contains the documents, the queries contains the queries, and the qrels contains the relevance scores between the queries and the documents.

In the first iteration of the ViDoRe benchmark, we arbitrarily choose to deduplicate the queries for the QA datasets. While this made sense given our data generation process, it wasn't suited for our ViDoRe benchmark v2 which aims at being broader and multilingual. We will release the ViDoRe benchmark v2 soon.

Dataset	Dataset format	Deduplicate queries
ViDoRe benchmark v1	QA	✅
ViDoRe benchmark v2 (harder/multilingual, not released yet)	BEIR	❌

Documentation

To have more control over the evaluation process (e.g. the batch size used at inference), read the CLI documentation using:

vidore-benchmark evaluate-retriever --help

In particular, feel free to play with the --batch-query, --batch-passage, --batch-score, and --num-workers inputs to speed up the evaluation process.

Python usage

Quickstart example

While the CLI can be used to evaluate a fixed list of models, you can also use the Python API to evaluate your own retriever. Here is an example of how to evaluate the ColPali model on the ViDoRe benchmark. Note that your processor must implement a process_images and a process_queries methods, similarly to the ColVision processors.

from typing import Dict, Optional

import torch
from colpali_engine.models import ColIdefics3, ColIdefics3Processor
from datasets import load_dataset
from tqdm import tqdm

from vidore_benchmark.evaluation.vidore_evaluators import ViDoReEvaluatorQA
from vidore_benchmark.retrievers import VisionRetriever
from vidore_benchmark.utils.data_utils import get_datasets_from_collection

model_name = "vidore/colSmol-256M"
processor = ColIdefics3Processor.from_pretrained(model_name)
model = ColIdefics3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
).eval()

# Get retriever instance
vision_retriever = VisionRetriever(
    model=model,
    processor=processor,
)
vidore_evaluator = ViDoReEvaluatorQA(vision_retriever)

# Evaluate on a single dataset
dataset_name = "vidore/tabfquad_test_subsampled"

ds = load_dataset(dataset_name, split="test")
metrics_dataset = vidore_evaluator.evaluate_dataset(
    ds=ds,
    batch_query=4,
    batch_passage=4,
    batch_score=4,
)

# Evaluate on a local directory or a HuggingFace collection
collection_name = "vidore/vidore-benchmark-667173f98e70a1c0fa4db00d"  # ViDoRe Benchmark

dataset_names = get_datasets_from_collection(collection_name)
metrics_collection: Dict[str, Dict[str, Optional[float]]] = {}
for dataset_name in tqdm(dataset_names, desc="Evaluating dataset(s)"):
    metrics_collection[dataset_name] = vidore_evaluator.evaluate_dataset(
        ds=load_dataset(dataset_name, split="test"),
        batch_query=4,
        batch_passage=4,
        batch_score=4,
    )

Implement your own retriever

If you want to evaluate your own retriever to use it with the CLI, you should clone the repository and add your own class that inherits from BaseVisionRetriever. You can find the detailed instructions here.

Compare retrievers using the EvalManager

To easily process, visualize and compare the evaluation metrics of multiple retrievers, you can use the EvalManager class. Assume you have a list of previously generated JSON metric files, e.g.:

data/metrics/
├── bisiglip.json
└── colpali.json

The data is stored in eval_manager.data as a multi-column DataFrame with the following columns. Use the get_df_for_metric, get_df_for_dataset, and get_df_for_model methods to get the subset of the data you are interested in. For instance:

from vidore_benchmark.evaluation import EvalManager

eval_manager = EvalManager.from_dir("data/metrics/")
df = eval_manager.get_df_for_metric("ndcg_at_5")

Citation

ColPali: Efficient Document Retrieval with Vision Language Models

Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution)

@misc{faysse2024colpaliefficientdocumentretrieval,
      title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
      author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
      year={2024},
      eprint={2407.01449},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.01449}, 
}

If you want to reproduce the results from the ColPali paper, please read the REPRODUCIBILITY.md file for more information.

FAQs

What is vidore-benchmark?

Is vidore-benchmark well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

vidore-benchmark

Vision Document Retrieval (ViDoRe): Benchmark 👀

Approach

Usage

CLI mode

Library mode

Command-line usage

Evaluate a retriever on ViDoRE

Documentation

Python usage

Quickstart example

Implement your own retriever

Compare retrievers using the EvalManager

Citation

Related posts

CISA Rebuffs Funding Concerns as CVE Foundation Draws Criticism

Introducing Historical Analytics – Now in Beta