Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasets
Rensa (Swedish for "clean") is a high-performance MinHash implementation written in Rust with Python bindings. It's designed for efficient similarity estimation and deduplication of large datasets.
Rensa implements a variant of the MinHash algorithm that combines ideas from traditional MinHash and the C-MinHash algorithm proposed in the paper C-MinHash: Rigorously Reducing K Permutations to Two to create a novel MinHash implementation that I call R-MinHash
.
Rensa is particularly useful in scenarios where you need to:
Use cases include:
Key aspects of Rensa's implementation include:
Efficient permutation generation: Instead of storing full permutations or using k independent hash functions, Rensa uses a pair of random numbers (a, b) to generate permutations on-the-fly. This approach significantly reduces memory usage while maintaining the algorithm's effectiveness.
Simplified C-MinHash: While inspired by C-MinHash, Rensa's implementation differs in a few key ways:
Trade-off between memory and variance reduction: Rensa's approach trades some of the variance reduction benefits of full C-MinHash for improved memory efficiency and simplicity. While it may not achieve the same level of variance reduction as C-MinHash, it still offers better performance than traditional MinHash in many scenarios.
Fast hash function: Rensa uses the rustc-hash crate which implements the FxHash algorithm, a fast, non-cryptographic hash function, to further optimize performance.
Vectorized operations: The R-MinHash computation is optimized using vector operations, allowing for efficient parallel processing of multiple hash values.
Memory-efficient data structures: The implementation uses compact data structures to minimize memory usage while maintaining fast access times.
Efficient LSH implementation: The LSH index uses a band-based approach with optimized data structures for fast insertion and query operations.
These design choices result in a MinHash implementation that is fast, memory-efficient, and suitable for large-scale similarity estimation and deduplication tasks. While Rensa may not provide the same theoretical guarantees as full C-MinHash, our benchmarks show that it offers significant performance improvements over traditional MinHash implementations like datasketch
.
You can install Rensa using pip
. It's available in all platforms:
pip install rensa
Here's an example of how to use Rensa to deduplicate a dataset:
from datasets import load_dataset
from rensa import RMinHash
from tqdm import tqdm
def rensa_minhash(text, num_perm=128):
m = RMinHash(num_perm=num_perm, seed=42)
m.update(text.split())
return m
def deduplicate_dataset(dataset, num_perm=128):
unique_hashes = set()
deduplicated_indices = []
for idx, example in tqdm(enumerate(dataset), total=len(dataset), desc="Deduplicating"):
minhash = rensa_minhash(example["sql"], num_perm)
hash_tuple = tuple(minhash.digest())
if hash_tuple not in unique_hashes:
unique_hashes.add(hash_tuple)
deduplicated_indices.append(idx)
return deduplicated_indices
def main():
print("Loading dataset...")
sql_dataset = load_dataset("gretelai/synthetic_text_to_sql", split="train")
print("Deduplicating dataset...")
deduplicated_indices = deduplicate_dataset(sql_dataset)
deduplicated_dataset = sql_dataset.select(deduplicated_indices)
print("\nDeduplication Results:")
print(f"Original dataset size: {len(sql_dataset)}")
print(f"Deduplicated dataset size: {len(deduplicated_dataset)}")
print(f"Rows removed: {len(sql_dataset) - len(deduplicated_dataset)}")
if __name__ == "__main__":
main()
Rensa significantly outperforms datasketch
in terms of speed. The table below provides a detailed comparison of execution times for different numbers of permutations:
Permutations | Datasketch Time (s) | Rensa Time (s) | Speedup |
---|---|---|---|
64 | 34.48 | 4.89 | 7.05x faster |
128 | 49.62 | 5.21 | 9.52x faster |
256 | 84.76 | 6.39 | 13.26x faster |
Memory usage is comparable between Rensa and datasketch
, with Rensa showing slightly better performance for smaller numbers of permutations. The table below provides the details:
Permutations | Datasketch Memory (MB) | Rensa Memory (MB) | Difference (MB) |
---|---|---|---|
64 | 265.75 | 242.36 | 23.39 less |
128 | 487.02 | 472.97 | 14.05 less |
256 | 811.64 | 774.49 | 37.15 less |
Despite the simplified implementation, Rensa achieves the same deduplication results as datasketch
. The Jaccard similarity between the deduplicated sets produced by both libraries is 1.0000, indicating identical results.
To run the benchmarks yourself, follow these steps:
Clone the repository:
git clone https://github.com/beowolx/rensa.git
cd rensa
Create a virtual environment:
python3 -m venv venv
source venv/bin/activate
Install the required dependencies:
pip install -r requirements.txt
Run the simple benchmark:
python benchmarks/simple_benchmark.py
Run the advanced benchmark:
python benchmarks/advanced_benchmark.py
The simple_benchmark.py
script provides a basic comparison of deduplication performance between Rensa and datasketch
. The advanced_benchmark.py
script offers a more comprehensive analysis, including multiple runs with different numbers of permutations, memory usage tracking, and detailed profiling information.
While Rensa offers significant performance improvements, it has some limitations compared to datasketch
:
Feature set: Rensa currently implements only the core MinHash and LSH functionality. It doesn't include some of the advanced features found in datasketch
.
Customization: datasketch
offers more options for customizing the hash functions and other parameters, while Rensa currently has a more fixed implementation.
Theoretical guarantees: Due to the simplified C-MinHash implementation, Rensa may not provide the same level of variance reduction as the full C-MinHash algorithm in all scenarios.
Future work on Rensa may include:
Despite these limitations, Rensa's performance benefits make it an excellent choice for applications where speed and efficiency are critical, especially when working with large datasets.
Contributions to Rensa are welcome! Please feel free to submit pull requests, report bugs, or suggest features through the GitHub issue tracker.
Rensa is released under the MIT License. See the LICENSE file for details.
FAQs
High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasets
We found that rensa demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.