You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

nlp-dedup

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

nlp-dedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.

0.1.2
pipPyPI
Maintainers
1

NLPDedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.

Documentation License LastCommit Code Coverage

Developers:

Installation

The package is available on PyPI, so you can install the package using your favourite package manager. For instance, pip install nlp_dedup or poetry add nlp_dedup.

Quick Start

If the corpus is stored as corpus.txt (both txt and jsonl files are supported), the following deduplicates the corpus and stores the deduplicates corpus into the folder deduplicated:

$ dedup corpus.txt deduplicated

This defaults to deduplicating based on blocks of 13 consecutive words, where two documents are considered near-duplicate if they have more than 80% of these blocks in common. This can all be changed to your specific needs, however. See $ dedup --help for more information on all the settings.

Deduplication can also be done directly from Python:

>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)

Here corpus does not have to be a list, but can also be an iterable or generator of strings, if the corpus is too big to be stored in memory. Dictionaries are also supported instead of strings, in which case the text entry in the dictionaries will be used (change this with the text_column argument when calling deduplicate).

See more in the documentation.

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts