Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement
Sign In

text-dedup

Package Overview
Dependencies
Maintainers
1
Versions
25
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

text-dedup

All-in-one text deduplication tools

pipPyPI
Version
0.4.1
Maintainers
1

Python 3.12+ GitHub Codacy Badge Codacy Badge DOI

Installation

git clone https://github.com/ChenghaoMou/text-dedup
cd text-dedup
uv sync

Documentation

Github Pages

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

  • MinHash + MinHashLSH for near-duplicate detection
  • 64 or 128 bit SimHash
  • SuffixArray Substring exact deduplication
  • Bloom Filter exact deduplication

All algorithms use a config-based approach with TOML files for easy customization.

Quick Start

All deduplication scripts read from a config.toml file in the project root.

1. Configure your settings

Edit config.toml with your input data and algorithm settings:

MinHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "minhash"
text_column = "text"
seed = 42
batch_size = 10000
num_perm = 240
threshold = 0.7
false_positive_weight = 0.5
false_negative_weight = 0.5
hash_bits = 64
ngram_size = 5
check_false_positive = true

[output]
output_dir = "output"
clean_cache = false
save_clusters = true

[debug]
enable_profiling = false
SimHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "simhash"
text_column = "text"
hash_bits = 64
ngram_size = 3
bit_diff = 3

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false
Bloom Filter Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "bloom_filter"
text_column = "text"
error_rate = 1e-5
expected_elements = 100000

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false
Suffix Array Substring Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "suffix_array"
text_column = "text"
google_repo_path = "third_party/deduplicate-text-datasets"
merge_strategy = "longest"
length_threshold = 100
cache_dir = ".cache"

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false

2. Run the deduplication

# MinHash
python -m text_dedup.minhash

# SimHash
python -m text_dedup.simhash

# Bloom Filter
python -m text_dedup.bloom_filter

# Suffix Array
python -m text_dedup.suffix_array

Benchmarks

pinecone/core-2020-05-10-deduplication
AlgorithmPrecision (Duplicates)Recall (Duplicates)Precision (Non Duplicates)Recall (Non Duplicates)Macro F1 scoreAccuracyTime
MinHash0.95870.94160.94500.96110.95180.927711.09s
SimHash0.90380.73230.79930.93180.85150.8375626.11s
Exact Title Matching 10.8300.500.7090.9920.7570.746-
Simhash Matching 10.6970.2470.5980.9850.6310.616-
Document Vector Similarity 10.9120.7790.8610.9860.8850.883-
Hybrid Method 10.9080.8280.8990.9790.9040.903-
LaBSE20.9370.9230.9300.9430.9330.919-
Multilingual USE20.9170.9070.9180.9270.9170.909-
Multilingual E5-Base20.9310.9080.9190.9390.9240.920-
MinHash + LSH20.9290.9020.9150.9380.9210.918-
RETSim Partial-Dup20.9450.9410.9450.9490.9450.928-
RETSim Near-Dup20.9280.9370.9420.9340.9350.926-
NEWS-COPY

Adjusted Rand Index (ARI) on NEWS-COPY dataset:

Model/AlgorithmARITime
MinHash0.72933.01s
SimHash0.6463140.03s
n-gram 30.440-
SimHash20.695-
MinHash30.737-
MinHash20.783-
Multilingual USE20.730-
Multilingual E5-Base20.742-
S-BERT30.700-
RETSim Partial-Dup20.831-
RETSim Near-Dup20.704-
Re-ranking 30.937-
Bi-encoder 30.915-

Running Benchmarks

You can reproduce the benchmark results using the provided benchmark suite.

Quick Start with Just

# Run all benchmarks (both datasets, all algorithms)
just benchmark-all

# Run only CORE dataset benchmarks
just benchmark-core

# Run only NEWS-COPY dataset benchmarks
just benchmark-news

# Run specific algorithm on specific dataset
just benchmark-core-minhash
just benchmark-core-simhash
just benchmark-news-minhash
just benchmark-news-simhash

Configuration Files

Benchmark configuration files are located in configs/:

  • benchmark_core_minhash.toml - MinHash on CORE dataset
  • benchmark_core_simhash.toml - SimHash on CORE dataset
  • benchmark_news_minhash.toml - MinHash on NEWS-COPY dataset
  • benchmark_news_simhash.toml - SimHash on NEWS-COPY dataset

To customize benchmark parameters, edit the config files and adjust hyperparameters like num_perm, threshold, ngram_size, or bit_diff.

License

Apache 2.0

Citations

Generally, you can cite this repository as:

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Footnotes

Keywords

python

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts