Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Remove duplicates and near-duplicates from text corpora, no matter the scale.
Developers:
The package is available on PyPI, so you can install the package using your favourite
package manager. For instance, pip install nlp_dedup
or poetry add nlp_dedup
.
If the corpus is stored as corpus.txt
(both txt
and jsonl
files are supported),
the following deduplicates the corpus and stores the deduplicates corpus into the
folder deduplicated
:
$ dedup corpus.txt deduplicated
This defaults to deduplicating based on blocks of 13 consecutive words, where two
documents are considered near-duplicate if they have more than 80% of these blocks in
common. This can all be changed to your specific needs, however. See $ dedup --help
for more information on all the settings.
Deduplication can also be done directly from Python:
>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)
Here corpus
does not have to be a list, but can also be an iterable or generator of
strings, if the corpus is too big to be stored in memory. Dictionaries are also
supported instead of strings, in which case the text
entry in the dictionaries will
be used (change this with the text_column
argument when calling deduplicate
).
See more in the documentation.
FAQs
Remove duplicates and near-duplicates from text corpora, no matter the scale.
We found that nlp-dedup demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.