Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
A lightweight unified API for various reranking models. Developed by @bclavie as a member of answer.ai
Welcome to rerankers
! Our goal is to provide users with a simple API to use any reranking models.
A longer release history can be found in the Release History section of this README.
rerankers
goes multi-modal, with the support of the first MonoVLMRanker model, MonoQwen2-VL-v0.1!! + Many QoL fixes..top_k()
to get sorted results!BGE
layerwise LLM rerankers, based on Gemma and MiniCPM. These are different from RankGPT, as they're not listwise: the models are repurposed as "cross-encoders", and do output logit scores.rerankers
?Rerankers are an important part of any retrieval architecture, but they're also often more obscure than other parts of the pipeline.
Sometimes, it can be hard to even know which one to use. Every problem is different, and the best model for use X is not necessarily the same one as for use Y.
Moreover, new reranking methods keep popping up: for example, RankGPT, using LLMs to rerank documents, appeared just last year, with very promising zero-shot benchmark results.
All the different reranking approaches tend to be done in their own library, with varying levels of documentation. This results in an even higher barrier to entry. New users are required to swap between multiple unfamiliar input/output formats, all with their own quirks!
rerankers
seeks to address this problem by providing a simple API for all popular rerankers, no matter the architecture.
rerankers
aims to be:
rank()
function call mapping a (query, [documents]) input to a RankedResults
output.Installation is very simple. The core package ships with just two dependencies, tqdm
and pydantic
, so as to avoid any conflict with your current environment.
You may then install only the dependencies required by the models you want to try out:
# Core package only, will require other dependencies already installed
pip install rerankers
# All transformers-based approaches (cross-encoders, t5, colbert)
pip install "rerankers[transformers]"
# RankGPT
pip install "rerankers[gpt]"
# API-based rerankers (Cohere, Jina, MixedBread, Pinecone)
pip install "rerankers[api]"
# FlashRank rerankers (ONNX-optimised, very fast on CPU)
pip install "rerankers[flashrank]"
# RankLLM rerankers (better RankGPT + support for local models such as RankZephyr and RankVicuna)
# Note: RankLLM is only supported on Python 3.10+! This will not work with Python 3.9
pip install "rerankers[rankllm]"
# To support Multi-Modal rerankers such as MonoQwen2-VL and other MonoVLM models, which require flash-attention, peft, accelerate, and recent versions of `transformers`
pip install "rerankers[monovlm]"
# To support LLM-Layerwise rerankers (which need flash-attention installed)
pip install "rerankers[llmlayerwise]"
# All of the above
pip install "rerankers[all]"
Load any supported reranker in a single line, regardless of the architecture:
from rerankers import Reranker
# Cross-encoder default. You can specify a 'lang' parameter to load a multilingual version!
ranker = Reranker('cross-encoder')
# Specific cross-encoder
ranker = Reranker('mixedbread-ai/mxbai-rerank-large-v1', model_type='cross-encoder')
# FlashRank default. You can specify a 'lang' parameter to load a multilingual version!
ranker = Reranker('flashrank')
# Specific flashrank model.
ranker = Reranker('ce-esci-MiniLM-L12-v2', model_type='flashrank')
# Default T5 Seq2Seq reranker
ranker = Reranker("t5")
# Specific T5 Seq2Seq reranker
ranker = Reranker("unicamp-dl/InRanker-base", model_type = "t5")
# API (Cohere)
ranker = Reranker("cohere", lang='en' (or 'other'), api_key = API_KEY)
# Custom Cohere model? No problem!
ranker = Reranker("my_model_name", api_provider = "cohere", api_key = API_KEY)
# API (Pinecone)
ranker = Reranker("pinecone", api_key = API_KEY)
# API (Jina)
ranker = Reranker("jina", api_key = API_KEY)
# RankGPT4-turbo
ranker = Reranker("rankgpt", api_key = API_KEY)
# RankGPT3-turbo
ranker = Reranker("rankgpt3", api_key = API_KEY)
# RankGPT with another LLM provider
ranker = Reranker("MY_LLM_NAME" (check litellm docs), model_type = "rankgpt", api_key = API_KEY)
# RankLLM with default GPT (GPT-4o)
ranker = Reranker("rankllm", api_key = API_KEY)
# RankLLM with specified GPT models
ranker = Reranker('gpt-4-turbo', model_type="rankllm", api_key = API_KEY)
# ColBERTv2 reranker
ranker = Reranker("colbert")
# LLM Layerwise Reranker
ranker = Reranker('llm-layerwise')
# ... Or a non-default colbert model:
ranker = Reranker(model_name_or_path, model_type = "colbert")
Rerankers will always try to infer the model you're trying to use based on its name, but it's always safer to pass a model_type
argument to it if you can!
Then, regardless of which reranker is loaded, use the loaded model to rank a query against documents:
> results = ranker.rank(query="I love you", docs=["I hate you", "I really like you"], doc_ids=[0,1])
> results
RankedResults(results=[Result(document=Document(text='I really like you', doc_id=1), score=-2.453125, rank=1), Result(document=Document(text='I hate you', doc_id=0), score=-4.14453125, rank=2)], query='I love you', has_scores=True)
You don't need to pass doc_ids
! If not provided, they'll be auto-generated as integers corresponding to the index of a document in docs
.
You're free to pass metadata too, and it'll be stored with the documents. It'll also be accessible in the results object:
> results = ranker.rank(query="I love you", docs=["I hate you", "I really like you"], doc_ids=[0,1], metadata=[{'source': 'twitter'}, {'source': 'reddit'}])
> results
RankedResults(results=[Result(document=Document(text='I really like you', doc_id=1, metadata={'source': 'twitter'}), score=-2.453125, rank=1), Result(document=Document(text='I hate you', doc_id=0, metadata={'source': 'reddit'}), score=-4.14453125, rank=2)], query='I love you', has_scores=True)
If you'd like your code to be a bit cleaner, you can also directly construct Document
objects yourself, and pass those instead. In that case, you don't need to pass separate doc_ids
and metadata
:
> from rerankers import Document
> docs = [Document(text="I really like you", doc_id=0, metadata={'source': 'twitter'}), Document(text="I hate you", doc_id=1, metadata={'source': 'reddit'})]
> results = ranker.rank(query="I love you", docs=docs)
> results
RankedResults(results=[Result(document=Document(text='I really like you', doc_id=0, metadata={'source': 'twitter'}), score=-2.453125, rank=1), Result(document=Document(text='I hate you', doc_id=1, metadata={'source': 'reddit'}), score=-4.14453125, rank=2)], query='I love you', has_scores=True)
You can also use rank_async
, which is essentially just a wrapper to turn rank()
into a coroutine. The result will be the same:
> results = await ranker.rank_async(query="I love you", docs=["I hate you", "I really like you"], doc_ids=[0,1])
> results
RankedResults(results=[Result(document=Document(text='I really like you', doc_id=1, metadata={'source': 'twitter'}), score=-2.453125, rank=1), Result(document=Document(text='I hate you', doc_id=0, metadata={'source': 'reddit'}), score=-4.14453125, rank=2)], query='I love you', has_scores=True)
All rerankers will return a RankedResults
object, which is a pydantic object containing a list of Result
objects and some other useful information, such as the original query. You can retrieve the top k
results from it by running top_k()
:
> results.top_k(1)
[Result(Document(doc_id=1, text='I really like you', metadata={}), score=0.26170814, rank=1)]
The Result objects are transparent when trying to access the documents they store, as Document
objects simply exist as an easy way to store IDs and metadata. If you want to access a given result's text or metadata, you can directly access it as a property:
> results.top_k(1)[0].text
'I really like you'
And that's all you need to know to get started quickly! Check out the overview notebook for more information on the API and the different models, or the langchain example to see how to integrate this in your langchain pipeline.
Legend:
Models:
Features:
scifact
matches the litterature for any given model implementation (Except RankGPT, where results are harder to reproduce).If rerankers has been useful to you in academic work, please do feel free to cite the work below!
@misc{clavié2024rerankers,
title={rerankers: A Lightweight Python Library to Unify Ranking Methods},
author={Benjamin Clavié},
year={2024},
eprint={2408.17344},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2408.17344},
}
Document
object, courtesy of joint-work by @bclavie and Anmol6. This object is transparent, but now offers support for metadata
stored alongside each document. Many small QoL changes (RankedResults can be itered on directly...)FAQs
A unified API for various document re-ranking models.
We found that rerankers demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.