Product
Introducing License Enforcement in Socket
Ensure open-source compliance with Socket’s License Enforcement Beta. Set up your License Policy and secure your software!
goldenretriever-core
Advanced tools
Install the library from PyPi:
pip install goldenretriever-core
or from source:
git clone https://github.com/Riccorl/golden-retriever.git
cd golden-retriever
pip install -e .
Install with optional dependencies for FAISS
FAISS pypi package is only available for CPU. If you want to use GPU, you need to install it from source or use the conda package.
For CPU:
pip install goldenretriever-core[faiss]
For GPU:
conda create -n goldenretriever python=3.11
conda activate goldenretriever
# install pytorch
conda install -y pytorch=2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
# GPU
conda install -y -c pytorch -c nvidia faiss-gpu=1.8.0
# or GPU with NVIDIA RAFT
conda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0
pip install goldenretriever-core
Golden Retriever is built on top of PyTorch Lightning and Hydra. To run an experiment, you need to create a configuration file and pass
it to the golden-retriever
command. Few examples are provided in the conf
folder.
Here a simple example on how to train a DPR-like Retriever on the NQ dataset. First download the dataset from DPR. The run the following code:
golden-retriever train conf/nq-dpr.yaml
from goldenretriever.trainer import Trainer
from goldenretriever import GoldenRetriever
from goldenretriever.data.datasets import InBatchNegativesDataset
retriever = GoldenRetriever(
question_encoder="",
document_index="",
device="cuda",
precision="16",
)
test_dataset = InBatchNegativesDataset(
name="test",
path="",
tokenizer=retriever.question_tokenizer,
question_batch_size=64,
passage_batch_size=400,
max_passage_length=64,
)
trainer = Trainer(
retriever=retriever,
test_dataset=test_dataset,
log_to_wandb=False,
top_k=[20, 100]
)
trainer.test()
Golden Retriever supports distributed training. For the moment, it is only possible to train on a single node with multiple GPUs and without model sharding, i.e.
only DDP and FSDP with NO_SHARD
strategy are supported.
To run a distributed training, just add the following keys to the configuration file:
devices: 4 # number of GPUs
# strategy: "ddp_find_unused_parameters_true" # DDP
# FSDP with NO_SHARD
strategy:
_target_: lightning.pytorch.strategies.FSDPStrategy
sharding_strategy: "NO_SHARD"
from goldenretriever import GoldenRetriever
retriever = GoldenRetriever(
question_encoder="path/to/question/encoder",
passage_encoder="path/to/passage/encoder",
document_index="path/to/document/index"
)
# retrieve documents
retriever.retrieve("What is the capital of France?", k=5)
The retriever expects a jsonl file similar to DPR:
[
{
"question": "....",
"answers": ["...", "...", "..."],
"positive_ctxs": [{
"title": "...",
"text": "...."
}],
"negative_ctxs": ["..."],
"hard_negative_ctxs": ["..."]
},
...
]
The document to index can be either a jsonl file or a tsv file similar to DPR:
jsonl
: each line is a json object with the following keys: id
, text
, metadata
tsv
: each line is a tab-separated string with the id
and text
column,
followed by any other column that will be stored in the metadata
fieldjsonl example:
[
{
"id": "...",
"text": "...",
"metadata": ["{...}"]
},
...
]
tsv example:
id \t text \t any other column
...
FAQs
Dense Retriever
We found that goldenretriever-core demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Ensure open-source compliance with Socket’s License Enforcement Beta. Set up your License Policy and secure your software!
Product
We're launching a new set of license analysis and compliance features for analyzing, managing, and complying with licenses across a range of supported languages and ecosystems.
Product
We're excited to introduce Socket Optimize, a powerful CLI command to secure open source dependencies with tested, optimized package overrides.