Content Deduplicate
This module is deprecated.
You can still use it, but node-simhash
provides better approach and value.
See https://medium.com/@jonathankoren/near-duplicate-detection-b6694e807f7a and https://moz.com/devblog/near-duplicate-detection.
This module brings a distance functions and various helpers to calculate distance between strings. It is designed to prevent "duplicate content": avoid having 2 texts which are too close.
It can be used for SEO purposes or anything.
It is tailored for middle-sized strings, let's say 30 to 300 words.
Provided functions are:
- a specific distance function - it is the main added value of the module
- a helper function that finds out whichs texts are too close from a given text
- a clustering function, which groups close texts together
All functions are language dependant. Supported languages are French, English, Italian and German.
Distance calculation
How it works
The main function calculates distances between two strings. It is tailored for middle-sized strings - 30 to 300 words. It is less "strict" than Levenshtein distance as the idea is to see how close two strings look like.
The algorithm is the following:
- stemming of both strings (which also removes punctuation and numbers)
- remove stop words
- each longest common substring counts for 1 (except the very first common one, to respect
dist(a,a) = 0
) - each left element counts for 1
For example, AAA BBB CCC KKK PPP OOO
to ZZZ AAA BBB CCC PPP OOO
= 3:
AAA BBB CCC
is common => +0 (as it is the first match)- left:
KKK PPP OOO
vs ZZZ PPP OOO
PPP OOO
is common => +1- left:
KKK
vs ZZZ
=> +2
Mathematically it is almost a real distance:
d(a,b) = d(b,a)
is trued(a,b) = 0 <=> a = b
is not really true as 2 strings can have a 0 distance even when they are different: for instance he has 5
and it has 6
have a 0 distanced(a,c) <= d(a,b) + d(b,c)
is true
Raw distance
Use getDistanceRaw
to get that distance.
Relative distance
It is sometimes more useful to get a relative distance: pourcentage of how close two strings are. It uses the same algorithm but divides the result by the sum of the number of words of both strings:
- 1 => both strings are 100% different
- 0 => both strings are the same
- 0.3 => both strings are 30% different
Use getDistancePourcentage
to get that distance.
Example
Example:
const contentDeduplicate = require('./dist/index.js');
console.log(
contentDeduplicate.getDistanceRaw('I eat huge quantities of vegetables', 'he eats huge quantities of meat', 'en_US'),
);
console.log(
contentDeduplicate.getDistancePourcentage(
'I eat huge quantities of vegetables',
'he eats huge quantities of meat',
null,
'en_US',
),
);
In getDistancePourcentage
, the 3rd parameter is a threshold. If, while being calculated, the distance becomes greater than this threshold, calculation stops and 1 (100% of difference) is returned. This is used to improve speed - usually we care about close strings, but not about the exact distance of distant strings.
console.log(
contentDeduplicate.getDistancePourcentage(
'I eat huge quantities of vegetables and I love wine, beer and pineapples',
'he eats huge quantities of meat and I love wine, coca-cola, and pineapples',
0.1,
'en_US',
),
);
will output 1: the distance is not 1, but is greater than 0.1.
Distance report
Often you have a list of strings, and what to check how close they are each from other.
getDistanceReport
will calculate all distances and produce a JSON report containing, for each text, the closest ones, but also the most distant one.
Computation time can become quite long: 1 minute for a few hundreds of strings.
Parameters are the following:
- an array of textual objects; each object
must
have a text
property containing its string; feel free to put other properties typically an ID - the maximal acceptable distance: if the distance between two strings is lower than this threshold, then it will not be added in the list of the closest ones; use 0.2 for instance to only trigger when texts are 20% different or less
- the maximum number of closest strings to be given in the output (only the most close will be given)
- the language of the strings
The output is an array of objects:
for
: reference to the textual objectclosestOnes
: an array with the closes elements; each object points to an element (with
property) and gives the distance (difference
property)mostDifferent
is the most distant text (with
and difference
properties)
Example:
const contentDeduplicate = require('./dist/index.js');
const toCompare = [
{
id: 1,
text: 'I eat huge quantities of vegetables',
},
{
id: 2,
text: 'he eats huge quantities of meat',
},
{
id: 3,
text: 'she is vegan',
},
];
console.log(JSON.stringify(contentDeduplicate.getDistanceReport(toCompare, 0.3, 5, 'en_US'), null, 1));
will output:
[
{
"for": {
"id": 1,
"text": "I eat huge quantities of vegetables"
},
"closestOnes": [
{
"difference": 0.25,
"with": {
"id": 2,
"text": "he eats huge quantities of meat"
}
}
],
"mostDifferent": {
"with": {
"id": 3,
"text": "she is vegan"
},
"difference": 1
}
},
{
"for": {
"id": 2,
"text": "he eats huge quantities of meat"
},
"closestOnes": [
{
"difference": 0.25,
"with": {
"id": 1,
"text": "I eat huge quantities of vegetables"
}
}
],
"mostDifferent": {
"with": {
"id": 3,
"text": "she is vegan"
},
"difference": 1
}
},
{
"for": {
"id": 3,
"text": "she is vegan"
},
"closestOnes": [],
"mostDifferent": {
"with": {
"id": 2,
"text": "he eats huge quantities of meat"
},
"difference": 1
}
}
]
Clustering
Use getClusters
to cluster your texts, thanks to k-medoids
lib.
Input:
- an array of
Text
objects; each element must have a text
property, and you can also use an ID or something to know which are the texts - the number of clusters (it is not discovered automatically by this method)
- a language
Output: array of clusters.
Example:
const contentDeduplicate = require('./dist/index.js');
const toCompare = [
{
id: 1,
text: 'I eat huge quantities of vegetables',
},
{
id: 2,
text: 'he eats huge quantities of meat',
},
{
id: 3,
text: 'she is vegan',
},
];
console.log(JSON.stringify(contentDeduplicate.getClusters(toCompare, 2, 'en_US'), null, 1));
will output 2 clusters:
[
[
{
"id": 3,
"text": "she is vegan"
}
],
[
{
"id": 1,
"text": "I eat huge quantities of vegetables"
},
{
"id": 2,
"text": "he eats huge quantities of meat"
}
]
]
Performance and cache
When using getDistanceReport
and getClusters
, 2 caches are used to avoid:
- preparing (stemming stopwords etc.) the same string multiple times
- recalculating already computed distances