Winnow
A tiny Ruby library for document fingerprinting.
What is document fingerprinting?
Document fingerprinting converts a document (e.g. a book, a piece of code, or
any other string) into a much smaller number of hashes called fingerprints. If
two documents share any fingerprints, then this means there is an exact
substring match between the two documents.
Document fingerprinting has many applications, but the most obvious one is for
plagiarism detection. By taking fingerprints of two documents, you can detect if
parts of one document were copied from another.
This library implements a fingerprinting algorithm called winnowing, described
by Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken in a paper titled
Winnowing: Local Algorithms for Document Fingerprinting.
Usage
The Fingerprinter
class takes care of fingerprinting documents. To create a
fingerprint, you need to provide two parameters, called the noise threshold
and the guarantee threshold. When comparing two documents' fingerprints, no
match shorter than the noise threshold will be detected, but any match at least
as long as the guarantee threshold is guaranteed to be found.
The proper values for your noise and guarantee thresholds varies by context.
Experiment with the data you're looking at until you're happy with the results.
Creating a fingerprinter is easy:
fingerprinter = Winnow::Fingerprinter.new(noise_threshold: 10, guarantee_threshold: 18)
Then, use #fingerprints
get the fingerprints. Optionally, pass :source
(default is nil
) so that Winnow can later report which document a match is
from.
document = File.new('hamlet.txt')
fingerprints = fingerprinter.fingerprints(document.read, source: document)
#fingerprints
just returns a plain-old Ruby Hash
. The keys of the hash are
generated from substrings of the document being fingerprinted. Finding shared
substrings between documents is as simple as seeing if they share any of the
keys in their #fingerprints
hash.
To keep things easier for you, Winnow comes with a Matcher
class that will
find matches between two documents.
Here's an example that puts everything together:
require 'winnow'
str_a = <<-EOF
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
This is copied.
All mimsy were the borogoves,
And the mome raths outgrabe.
EOF
str_b = <<-EOF
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious -- This is copied. -- Bandersnatch!"
EOF
fprinter = Winnow::Fingerprinter.new(
guarantee_threshold: 13, noise_threshold: 9)
f1 = fprinter.fingerprints(str_a, source: "Stanza 1")
f2 = fprinter.fingerprints(str_b, source: "Stanza 2")
matches = Winnow::Matcher.find_matches(f1, f2)
match = matches.first
match_a = match.matches_from_a.first
match_b = match.matches_from_b.first
p match_a.index, match_b.index
match_context_a = str_a[match_a.index - 10 .. match_a.index + 20]
match_context_b = str_b[match_b.index - 10 .. match_b.index + 20]
puts "Match from #{match_a.source}: #{match_context_a.inspect}"
puts "Match from #{match_b.source}: #{match_context_b.inspect}"
You may find that Matcher
doesn't handle your exact use-case. That's not a
problem. The built-in matcher.rb file
is only about 10 lines of code, so you could easily make your own.
:boom: :bomb: A major caveat with String#hash
:bomb: :boom:
In order to avoid algorithmic complexity attacks, the value returned
from Ruby's String#hash
method changes every time you restart the
interpreter:
$ irb
2.0.0p353 :001 > "hello".hash
=> 482951767139383391
2.0.0p353 :002 > exit
$ irb
2.0.0p353 :001 > "hello".hash
=> 3216751850140847920
2.0.0p353 :002 > exit
(This is the case even if you're using JRuby.)
This means that although the winnowing algorithm should allow you to
precalculate a document's fingerprints and store them somewhere, doing so in
Ruby will not work unless you're careful to make sure you never restart your
Ruby runtime.
A workaround
Winnow looks for the presence of a String#consistent_hash
method. If it finds
one, it'll call that rather than call String#hash
. You can therefore describe
your own hash function if you want to precalculate fingerprint data.
I've put together a super-simple (but effective) gem called
consistent_hash that implements exactly this. It's about a
dozen lines of MRI C code and it'll probably work for you as well.