
Security News
/Research
npm Phishing Email Targets Developers with Typosquatted Domain
A phishing attack targeted developers using a typosquatted npm domain (npnjs.com) to steal credentials via fake login pages - watch out for similar scams.
Remove duplicates and near-duplicates from text corpora, no matter the scale.
Developers:
The package is available on PyPI, so you can install the package using your favourite
package manager. For instance, pip install nlp_dedup
or poetry add nlp_dedup
.
If the corpus is stored as corpus.txt
(both txt
and jsonl
files are supported),
the following deduplicates the corpus and stores the deduplicates corpus into the
folder deduplicated
:
$ dedup corpus.txt deduplicated
This defaults to deduplicating based on blocks of 13 consecutive words, where two
documents are considered near-duplicate if they have more than 80% of these blocks in
common. This can all be changed to your specific needs, however. See $ dedup --help
for more information on all the settings.
Deduplication can also be done directly from Python:
>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)
Here corpus
does not have to be a list, but can also be an iterable or generator of
strings, if the corpus is too big to be stored in memory. Dictionaries are also
supported instead of strings, in which case the text
entry in the dictionaries will
be used (change this with the text_column
argument when calling deduplicate
).
See more in the documentation.
FAQs
Remove duplicates and near-duplicates from text corpora, no matter the scale.
We found that nlp-dedup demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
/Research
A phishing attack targeted developers using a typosquatted npm domain (npnjs.com) to steal credentials via fake login pages - watch out for similar scams.
Security News
Knip hits 500 releases with v5.62.0, refining TypeScript config detection and updating plugins as monthly npm downloads approach 12M.
Security News
The EU Cyber Resilience Act is prompting compliance requests that open source maintainers may not be obligated or equipped to handle.