⚡️ Introduction
indxr is a Python utility for indexing long files that allows you to quickly read specific lines dynamically, avoiding hogging your RAM.
For example, given a 10M lines JOSNl file and a MacBook Pro from 2018, reading any specific line takes less than 10 µs, reading 1k non-contiguous lines takes less than 10 ms, reading 1k contiguous lines takes less than 2 ms, iterating over the entire file by reading batches of 32 lines takes less than 20 s (64 µs per batch). In other words, indxr allows you to use your disk as a RAM extension without noticeable slowdowns, especially with SSDs and NVMEs.
indxr can be particularly useful for dynamically loading data from large datasets with a low memory footprint and without slowing downstream tasks, such as data processing and Neural Networks training.
For an overview, follow the Usage section.
🔌 Installation
pip install indxr
💡 Usage
TXT
from indxr import Indxr
index = Indxr("sample.txt")
index[0]
index[1:3]
index.get("0")
index.mget(["2", "1"])
JSONl
from indxr import Indxr
index = Indxr("sample.jsonl", key_id="id")
index[42]
index[42:46]
index.get("id_123")
index.mget(["id_123", "id_321"])
CSV / TSV / ...
from indxr import Indxr
index = Indxr(
"sample.csv",
delimiter=",",
fieldnames=None,
has_header=True,
return_dict=True,
key_id="id",
)
index[42]
index[42:46]
index.get("id_123")
index.mget(["id_123", "id_321"])
Custom
from indxr import Indxr
index = Indxr("sample.something")
index[0]
index[1:3]
index.get("0")
index.mget(["2", "1"])
Callback (works with every file-type)
from indxr import Indxr
index = Indxr("sample.txt", callback=lambda x: x.split())
index.get("0")
>>>
Write / Read Index
from indxr import Indxr
index = Indxr("sample.txt", callback=lambda x: x.split())
index.write(path)
index = Indxr.read(path, callback=lambda x: x.split())
Usage example with PyTorch Dataset
In this example, we want to build a PyTorch Dataset that returns a query and two documents, one positive and one negative, for training a Neural retriever. The data is stored in two files, queries.jsonl
and documents.jsonl
. The first file contains queries and the second file contains documents. Each query has a list of associated positive and negative documents. Using Indxr
we can avoid loading the entire dataset into memory and we can load data dynamically, without slowing down the training process.
import random
from indxr import Indxr
from torch.utils.data import DataLoader, Dataset
class CustomDataset(Dataset):
def __init__(self):
self.queries = Indxr("queries.jsonl")
self.documents = Indxr("documents.jsonl")
def __getitem__(self, index: int):
query = self.queries[index]
pos_doc_id = random.choice(query["pos_doc_ids"])
neg_doc_id = random.choice(query["neg_doc_ids"])
pos_doc = self.documents.get(pos_doc_id)
neg_doc = self.documents.get(neg_doc_id)
return query["text"], pos_doc["text"], neg_doc["text"]
def __len__(self):
return len(self.queries)
def collator_fn(batch):
queries = [x[0] for x in batch]
pos_docs = [x[1] for x in batch]
neg_docs = [x[2] for x in batch]
queries = tokenizer(queries)
pos_docs = tokenizer(pos_docs)
neg_docs = tokenizer(neg_docs)
return queries, pos_docs, neg_docs
dataloader = DataLoader(
dataset=CustomDataset(),
collate_fn=collate_fn,
batch_size=32,
shuffle=True,
num_workers=4,
)
Each line of queries.jsonl
is as follows:
{
"q_id": "q321",
"text": "lorem ipsum",
"pos_doc_ids": ["d2789822", "d2558037", "d2594098"],
"neg_doc_ids": ["d3931445", "d4652233", "d191393", "d3692918", "d3051731"]
}
Each line of documents.jsonl
is as follows:
{
"doc_id": "d123",
"text": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit."
}
🎁 Feature Requests
Would you like to see other features implemented? Please, open a feature request.
🤘 Want to contribute?
Would you like to contribute? Please, drop me an e-mail.
📄 License
indxr is an open-sourced software licensed under the MIT license.