Research
Security News
Quasar RAT Disguised as an npm Package for Detecting Vulnerabilities in Ethereum Smart Contracts
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
content-deduplicate
Advanced tools
Calculates distances between texts to prevent duplicate content. DEPRECATED.
This module is deprecated.
You can still use it, but node-simhash
provides better approach and value.
See https://medium.com/@jonathankoren/near-duplicate-detection-b6694e807f7a and https://moz.com/devblog/near-duplicate-detection.
This module brings a distance functions and various helpers to calculate distance between strings. It is designed to prevent "duplicate content": avoid having 2 texts which are too close.
It can be used for SEO purposes or anything.
It is tailored for middle-sized strings, let's say 30 to 300 words.
Provided functions are:
All functions are language dependant. Supported languages are French, English, Italian and German.
The main function calculates distances between two strings. It is tailored for middle-sized strings - 30 to 300 words. It is less "strict" than Levenshtein distance as the idea is to see how close two strings look like.
The algorithm is the following:
dist(a,a) = 0
)For example, AAA BBB CCC KKK PPP OOO
to ZZZ AAA BBB CCC PPP OOO
= 3:
AAA BBB CCC
is common => +0 (as it is the first match)KKK PPP OOO
vs ZZZ PPP OOO
PPP OOO
is common => +1KKK
vs ZZZ
=> +2Mathematically it is almost a real distance:
d(a,b) = d(b,a)
is trued(a,b) = 0 <=> a = b
is not really true as 2 strings can have a 0 distance even when they are different: for instance he has 5
and it has 6
have a 0 distanced(a,c) <= d(a,b) + d(b,c)
is trueUse getDistanceRaw
to get that distance.
It is sometimes more useful to get a relative distance: pourcentage of how close two strings are. It uses the same algorithm but divides the result by the sum of the number of words of both strings:
Use getDistancePourcentage
to get that distance.
Example:
const contentDeduplicate = require('./dist/index.js');
// should be 2
console.log(
contentDeduplicate.getDistanceRaw('I eat huge quantities of vegetables', 'he eats huge quantities of meat', 'en_US'),
);
// should be 0.25
console.log(
contentDeduplicate.getDistancePourcentage(
'I eat huge quantities of vegetables',
'he eats huge quantities of meat',
null,
'en_US',
),
);
In getDistancePourcentage
, the 3rd parameter is a threshold. If, while being calculated, the distance becomes greater than this threshold, calculation stops and 1 (100% of difference) is returned. This is used to improve speed - usually we care about close strings, but not about the exact distance of distant strings.
console.log(
contentDeduplicate.getDistancePourcentage(
'I eat huge quantities of vegetables and I love wine, beer and pineapples',
'he eats huge quantities of meat and I love wine, coca-cola, and pineapples',
0.1,
'en_US',
),
);
will output 1: the distance is not 1, but is greater than 0.1.
Often you have a list of strings, and what to check how close they are each from other.
getDistanceReport
will calculate all distances and produce a JSON report containing, for each text, the closest ones, but also the most distant one.
Computation time can become quite long: 1 minute for a few hundreds of strings.
Parameters are the following:
must
have a text
property containing its string; feel free to put other properties typically an IDThe output is an array of objects:
for
: reference to the textual objectclosestOnes
: an array with the closes elements; each object points to an element (with
property) and gives the distance (difference
property)mostDifferent
is the most distant text (with
and difference
properties)Example:
const contentDeduplicate = require('./dist/index.js');
const toCompare = [
{
id: 1,
text: 'I eat huge quantities of vegetables',
},
{
id: 2,
text: 'he eats huge quantities of meat',
},
{
id: 3,
text: 'she is vegan',
},
];
console.log(JSON.stringify(contentDeduplicate.getDistanceReport(toCompare, 0.3, 5, 'en_US'), null, 1));
will output:
[
{
"for": {
"id": 1,
"text": "I eat huge quantities of vegetables"
},
"closestOnes": [
{
"difference": 0.25,
"with": {
"id": 2,
"text": "he eats huge quantities of meat"
}
}
],
"mostDifferent": {
"with": {
"id": 3,
"text": "she is vegan"
},
"difference": 1
}
},
{
"for": {
"id": 2,
"text": "he eats huge quantities of meat"
},
"closestOnes": [
{
"difference": 0.25,
"with": {
"id": 1,
"text": "I eat huge quantities of vegetables"
}
}
],
"mostDifferent": {
"with": {
"id": 3,
"text": "she is vegan"
},
"difference": 1
}
},
{
"for": {
"id": 3,
"text": "she is vegan"
},
"closestOnes": [],
"mostDifferent": {
"with": {
"id": 2,
"text": "he eats huge quantities of meat"
},
"difference": 1
}
}
]
Use getClusters
to cluster your texts, thanks to k-medoids
lib.
Input:
Text
objects; each element must have a text
property, and you can also use an ID or something to know which are the textsOutput: array of clusters.
Example:
const contentDeduplicate = require('./dist/index.js');
const toCompare = [
{
id: 1,
text: 'I eat huge quantities of vegetables',
},
{
id: 2,
text: 'he eats huge quantities of meat',
},
{
id: 3,
text: 'she is vegan',
},
];
console.log(JSON.stringify(contentDeduplicate.getClusters(toCompare, 2, 'en_US'), null, 1));
will output 2 clusters:
[
[
{
"id": 3,
"text": "she is vegan"
}
],
[
{
"id": 1,
"text": "I eat huge quantities of vegetables"
},
{
"id": 2,
"text": "he eats huge quantities of meat"
}
]
]
When using getDistanceReport
and getClusters
, 2 caches are used to avoid:
FAQs
Calculates distances between texts to prevent duplicate content. DEPRECATED.
We found that content-deduplicate demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Security News
Research
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
Research
Security News
Socket researchers discovered a malware campaign on npm delivering the Skuld infostealer via typosquatted packages, exposing sensitive data.