# NLPDedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.

______________________________________________________________________
[![Documentation](https://img.shields.io/badge/docs-passing-green)](https://saattrupdan.github.io/NLPDedup/nlp_dedup.html)
[![License](https://img.shields.io/github/license/saattrupdan/NLPDedup)](https://github.com/saattrupdan/NLPDedup/blob/main/LICENSE)
[![LastCommit](https://img.shields.io/github/last-commit/saattrupdan/NLPDedup)](https://github.com/saattrupdan/NLPDedup/commits/main)
[![Code Coverage](https://img.shields.io/badge/Coverage-0%25-red.svg)](https://github.com/saattrupdan/NLPDedup/tree/main/tests)


Developers:

- Dan Saattrup Nielsen (dan.nielsen@alexandra.dk)
- Kenneth Enevoldsen (kennethcenevoldsen@gmail.com)


# Installation

The package is available on PyPI, so you can install the package using your favourite
package manager. For instance, `pip install nlp_dedup` or `poetry add nlp_dedup`.


# Quick Start

If the corpus is stored as `corpus.txt` (both `txt` and `jsonl` files are supported),
the following deduplicates the corpus and stores the deduplicates corpus into the
folder `deduplicated`:

```
$ dedup corpus.txt deduplicated
```

This defaults to deduplicating based on blocks of 13 consecutive words, where two
documents are considered near-duplicate if they have more than 80% of these blocks in
common. This can all be changed to your specific needs, however. See `$ dedup --help`
for more information on all the settings.

Deduplication can also be done directly from Python:

```
>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)
```

Here `corpus` does not have to be a list, but can also be an iterable or generator of
strings, if the corpus is too big to be stored in memory. Dictionaries are also
supported instead of strings, in which case the `text` entry in the dictionaries will
be used (change this with the `text_column` argument when calling `deduplicate`).

See more in [the documentation](https://saattrupdan.github.io/NLPDedup/nlp_dedup.html).


Remove duplicates and near-duplicates from text corpora, no matter the scale. 

Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6

nlp-dedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.

saattrupdan

(Experimental) An ambiguous license classifier was found.

What is an ambiguous license classifier?

Ambiguous License Classifier

Source files are encoded using a non-standard text encoding.

What is bad text encoding?

Bad text encoding

Package version is not a valid semantic version (semver).

What is bad semver?

Bad semver

Package has dependencies with an invalid semantic version. This could be a sign of beta, low quality, or unmaintained dependencies.

What is bad dependency semver?

Bad dependency semver

Source files contain bidirectional unicode control characters. This could indicate a Trojan source supply chain attack. See: trojansource.codes for more information.

What are bidirectional unicode control characters?

Bidirectional unicode control characters

This package has multiple bin scripts with the same name. This can cause non-deterministic behavior when installing or could be a sign of a supply chain attack.

What is bin script confusion?

Bin script confusion

Semantic versions published out of chronological order.

What is a chronological version anomaly?

Chronological version anomaly

Project maintainer's SSH key has been compromised.

What is a compromised SSH key?

Compromised SSH key

(Experimental) Copyleft license information was found.

What do I need to know about license files?

Copyleft License

Contains a Critical Common Vulnerability and Exposure (CVE).

What is a critical CVE?

Title

Critical CVE

Contains a high severity Common Vulnerability and Exposure (CVE).

What is a CVE?

High CVE

Uses debug, reflection and dynamic code execution features.

What is debug access?

Debug access

The maintainer of the package marked it as deprecated. This could indicate that a single version should not be used, or that the package is no longer maintained and any new vulnerabilities will not be fixed.

What is a deprecated package?

Deprecated

(Experimental) Contains a known deprecated SPDX license exception.

What is a deprecated SPDX exception?

Deprecated SPDX exception

(Experimental) License is deprecated which may have legal implications regarding the package's use.

What is a deprecated license?

Deprecated license

Package name is similar to other popular packages and may not be the package you want.

What is a typosquat?

Possible typosquat attack

Dynamic require can indicate the package is performing dangerous or unsafe dynamic code execution.

What is dynamic require?

Dynamic require

Package does not contain any code. It may be removed, is name squatting, or the result of a faulty package publish.

What is an empty package?

Empty package

Package accesses environment variables, which may be a sign of credential stuffing or data theft.

What is environment variable access?

Environment variable access

(Experimental) Something was found which is explicitly marked as unlicensed.

Explicitly Unlicensed Item

Package optionally loads a dependency which is not specified within any of the package.json dependency fields. It may inadvertently be importing dependencies specified by other packages.

What are extraneous dependencies?

Name

Extraneous dependency

Contains a dependency which resolves to a file. This can obfuscate analysis and serves no useful purpose.

What are file dependencies?

File dependency

Accesses the file system, and could potentially read sensitive data.

What is filesystem access?

Filesystem access

Package has a dependency with a floating version range. This can cause issues if the dependency publishes a new major version.

What are wildcard dependencies?

Wildcard dependency

Contains a dependency which resolves to a remote git URL. Dependencies fetched from git URLs are not immutable and can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are git dependencies?

Git dependency

Contains a dependency which resolves to a GitHub URL. Dependencies fetched from GitHub specifiers are not immutable can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are GitHub dependencies?

GitHub dependency

AI has identified unusual behaviors that may pose a security risk.

What is an AI-detected potential code anomaly?

AI-detected potential code anomaly

AI has identified this package as a potential typosquat of a more popular package. This suggests that the package may be intentionally mimicking another package's name, description, or other metadata.

What is AI-detected potential typosquatting?

AI-detected possible typosquat

AI has identified this package as malware. This is a strong signal that the package may be malicious.

What is AI-detected potential malware?

AI-detected potential malware

AI has determined that this package may contain potential security issues or vulnerabilities.

What are AI-detected potential security risks?

AI-detected potential security risk

Contains native code (e.g., compiled binaries or shared libraries). Including native code can obscure malicious behavior.

Why is native code a concern?

Native code

Contains high entropy strings. This could be a sign of encrypted data, leaked secrets or obfuscated code.

What are high entropy strings?

High entropy strings

Contains unicode homoglyphs which can be used in supply chain confusion attacks.

What are unicode homoglyphs?

Unicode homoglyphs

Contains a dependency which resolves to a remote HTTP URL which could be used to inject untrusted code and reduce overall package reliability.

What are http dependencies?

HTTP dependency

Install scripts are run when the package is installed. The majority of malware in npm is hidden in install scripts.

What is an install script?

Install scripts

Package has an invalid manifest file and can cause installation problems if you try to use it.

What is an invalid manifest file?

Invalid manifest file

Source files contain invisible characters. This could indicate source obfuscation or a supply chain attack.

What are invisible characters?

Invisible chars

(Experimental) Package license has recently changed.

What is a license change?

License change

(Experimental) Contains an SPDX license exception.

What is a license exception?

License exception

This package is not allowed per your license policy. Review the package's license to ensure compliance.

What is a license policy violation?

License Policy Violation

Contains long string literals, which may be a sign of obfuscated or packed code.

What's wrong with long strings?

Long strings

Package has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.

What is a major refactor?

Major refactor

This package is malware. We have asked the package registry to remove it.

What is known malware?

Known malware

This package has inconsistent metadata. This could be malicious or caused by an error when publishing the package.

What is manifest confusion?

Manifest confusion

Contains a medium severity Common Vulnerability and Exposure (CVE).

What is a medium CVE?

Medium CVE

Contains a low severity Common Vulnerability and Exposure (CVE).

What is a mild CVE?

Low CVE

This package contains minified code. This may be harmless in some cases where minified code is included in packaged libraries, however packages on npm should not minify code.

What's wrong with minified code?

Minified code

(Experimental) A package's licensing information has fine-grained problems.

Misc. License Issues

The package was published by an npm account that no longer exists.

What is a non-existent author?

Non-existent author

A required dependency is not declared in package.json and may prevent the package from working.

What is a missing dependency?

Missing dependency

(Experimental) Package does not have a license and consumption legal status is unknown.

What is a missing license?

Missing license

This package is missing its tarball. It could be removed from the npm registry or there may have been an error when publishing.

What is a missing tarball?

Missing package tarball

(Experimental) Package contains multiple licenses.

What is a mixed license?

Mixed license

(Experimental) Package contains a modified version of an SPDX license exception. Please read carefully before using this code.

What is a modified license exception?

Modified license exception

(Experimental) Package contains a modified version of an SPDX license. Please read carefully before using this code.

What is a modified license?

Modified license

What is network access?

Network access

A new npm collaborator published a version of the package for the first time. New collaborators are usually benign additions to a project, but do indicate a change to the security surface area of a package.

What is new author?

New author

Package does not specify a list of contributors or an author in package.json.

Why is contributor and author data important?

No contributors or author data

Package does not have a linked bug tracker in package.json.

Why are bug trackers important?

No bug tracker

(Experimental) License information could not be found.

No License Found

Package does not have a README. This may indicate a failed publish or a low quality package.

Why are READMEs important?

No README

Package does not have a linked source code repository. Without this field, a package will have no reference to the location of the source code use to generate the package.

Why are missing repositories important?

No repository

Package does not have any tests. This is a strong signal of a poorly maintained or low quality package.

What does no tests mean?

No tests

Package is not semver >=1. This means it is not stable and does not support ^ ranges.

nlp-dedup

NLPDedup

Installation

Quick Start

Related posts

Knip Hits 500 Releases with v5.62.0, Improving TypeScript Config Detection and Plugin Integrations

Open Source Maintainers Feeling the Weight of the EU’s Cyber Resilience Act