embed
A stable, blazing fast and easy-to-use inference library with a focus on a sync-to-async API
Installation
pip install embed
Why embed?
Embed makes it easy to load any embedding, classification and reranking models from Huggingface.
It leverages Infinity as backend for async computation, batching, and Flash-Attention-2.
Benchmarking on an Nvidia-L4 instance. Note: CPU uses bert-small, CUDA uses Bert-large. Methodology.
from embed import BatchedInference
from concurrent.futures import Future
register = BatchedInference(
model_id=[
"michaelfeil/bge-small-en-v1.5",
"jinaai/jina-clip-v1",
"philschmid/tiny-bert-sst2-distilled",
"mixedbread-ai/mxbai-rerank-xsmall-v1",
],
engine="torch",
device="cpu",
)
sentences = ["Paris is in France.", "Berlin is in Germany.", "A image of two cats."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
question = "Where is Paris?"
future: "Future" = register.embed(
sentences=sentences, model_id="michaelfeil/bge-small-en-v1.5"
)
future.result()
register.rerank(
query=question, docs=sentences, model_id="mixedbread-ai/mxbai-rerank-xsmall-v1"
)
register.classify(model_id="philschmid/tiny-bert-sst2-distilled", sentences=sentences)
register.image_embed(model_id="jinaai/jina-clip-v1", images=images)
register.stop()
All functions return Futures(vector_embedding, token_usage)
, enables you to wait
for them and removes batching logic from your code.
>>> embedding_fut = register.embed(sentences=sentences, model_id="michaelfeil/bge-small-en-v1.5")
>>> print(embedding_fut)
<Future at 0x7fa0e97e8a60 state=pending>
>>> time.sleep(1) and print(embedding_fut)
<Future at 0x7fa0e97e9c30 state=finished returned tuple>
>>> embedding_fut.result()
([array([-3.35943862e-03, ..., -3.22808176e-02], dtype=float32)], 19)
Licence and Contributions
embed is licensed as MIT. All contribrutions need to adhere to the MIT License. Contributions are welcome.