Socket
Socket
Sign inDemoInstall

nlp-dedup

Package Overview
Dependencies
4
Maintainers
1
Alerts
File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

    nlp-dedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.


Maintainers
1

Readme

NLPDedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.


Documentation License LastCommit Code Coverage

Developers:

Installation

The package is available on PyPI, so you can install the package using your favourite package manager. For instance, pip install nlp_dedup or poetry add nlp_dedup.

Quick Start

If the corpus is stored as corpus.txt (both txt and jsonl files are supported), the following deduplicates the corpus and stores the deduplicates corpus into the folder deduplicated:

$ dedup corpus.txt deduplicated

This defaults to deduplicating based on blocks of 13 consecutive words, where two documents are considered near-duplicate if they have more than 80% of these blocks in common. This can all be changed to your specific needs, however. See $ dedup --help for more information on all the settings.

Deduplication can also be done directly from Python:

>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)

Here corpus does not have to be a list, but can also be an iterable or generator of strings, if the corpus is too big to be stored in memory. Dictionaries are also supported instead of strings, in which case the text entry in the dictionaries will be used (change this with the text_column argument when calling deduplicate).

See more in the documentation.

FAQs


Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc