hyperloglog-redis
This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating
cardinalities of sets observed via a stream of events. A Redis
instance is used for storing the counters. A minimal example:
require 'redis'
require 'hyperloglog-redis'
counter = HyperLogLog::Counter.new(Redis.new)
['john', 'paul', 'george', 'ringo', 'john', 'paul'].each do |beatle|
counter.add('beatles', beatle)
end
puts "There are approximately #{counter.count('beatles')} distinct Beatles"
Each HyperLogLog counter uses a small, fixed amount of space but can
estimate the cardinality of any set of up to around a billion values with
relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a
parameter passed to the HyperLogLog::Counter
initializer that defaults to 10.
With b = 10, each counter is represented by a 1 KB string in Redis and we get
an expected relative error of 3%. Contrast this with the amount of space needed
to compute set cardinality exactly, which is over 100 MB for a even a bit vector
representing a set with a billion values.
The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is
to apply a good hash function to each value observed in the stream and record the longest
run of zeros seen as a prefix of any hashed value. If the hash
function is good, the bits in any hashed value should be close to statistically independent,
so seeing a value that starts with exactly X zeros should happen with probability close to
2 ** -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values,
you're likely to have around 2 ** 6 = 64 values in the underlying set. The actual
implementation and analysis are much more advanced than this, but that's the idea.
This gem implements a few useful extensions to the basic HyperLogLog algorithm
which allow you to estimate unions and intersections of counters as well as
counts within specific time ranges. These extensions are described in detail below.
The HyperLogLog algorithm is described and analyzed in the paper
"HyperLogLog: the analysis of a near-optimal cardinality estimation
algorithm"
by Flajolet, Fusy, Gandouet, and Meunier. Our implementation closely
follows the program described in Section 4 of that paper.
Unions and intersections
You can also ask for an estimate of the union from multiple counters:
['joe', 'denny', 'linda', 'jimmy', 'paul'].each do |wings_member|
counter.add('wings', wings_member)
end
puts "There are approximately #{counter.union(['beatles', 'wings'])} people who were in the Beatles or Wings"
The same relative error guarantee above applies to unions: a union of
size N can be estimated to within +/- N * (1.04 / Math.sqrt(2 ** b)) elements,
regardless of how many HyperLogLog counters that union spans. You can store
a unioned counter for querying or combining later with union_store
:
counter.union_store('all_beatles_and_wings_members', ['beatles', 'wings'])
puts "There are approximately #{counter.count('all_beatles_and_wings_members'}} people who were in the Beatles or Wings"
Intersections can also be estimated:
puts "There are approximately #{counter.intersection(['beatles', 'wings'])} people who were in both the Beatles and Wings"
However, intersections of HyperLogLog counters are calculated indirectly via the
inclusion/exclusion principle
as a sum of unions and there aren't good theoretical bounds on the error of that sum. In
practice, the estimates that come out of small intersections tend to follow the
same relative error patterns, but beware using this type of estimation on intersections
of large numbers of sets, both because the errors can be much larger than those guaranteed
for unions and the complexity of computing intersections grows exponentially with
the number of sets in the intersection.
Set cardinality within a time interval
All examples up until now use HyperLogLog::Counter
, which stores HyperLogLog
counters as (2 ** b)-byte Redis strings. hyperloglog-redis also contains the counter implementation
HyperLogLog::TimeSeriesCounter
, which uses a little more space (Redis strings of up to
4 * (32 - b) * (2 ** b) bytes) but allows you to estimate the cardinality of sets during
certain time windows.
Using HyperLogLog::TimeSeriesCounter
, you can get estimates of the number of distinct
elements added to a set in the past X seconds, for any value of X. A HyperLogLog::TimeSeriesCounter
is initialized with the same arguments as a regular Counter
but implements a
superset of HyperLogLog::Counter
's interface. Namely, each of the methods add
,
count
, union
, intersection
, and union_store
take an optional final time argument,
either a Ruby Time
or an integer representing seconds since the epoch.
When passed a time argument t, add
registers an addition to the set at time t. When no
time is passed, the current system time is used. The methods count
, union
,
intersection
, and union_store
all estimate set cardinality for the time interval
consisting of all events that happened after time t when t is passed as a final argument.
For example, to get the number of distinct user logins within the
past week, we might call:
one_week = 60 * 60 * 24 * 7
logins_in_past_week = counter.count('user_logins', Time.now - one_week)
A note about relative errors
With a parameter b
in the range [4..16], HyperLogLog counters provide a relative
error of 1.04/sqrt(2 ** b) with high probability. When unions, intersections, and
time range queries are involved, it's sometimes not clear what the relative error
is relative to, so here is some clarification:
-
For a union of counters, the relative error applies to the size of the union. Taking
the union of counters is lossless in the sense that you end up with the same counter
you would have arrived at had you observed the union of all of the individual events.
-
For an intersection of counters, there's no good theoretical bound on the relative
error. In practice, and especially for intersections involving a small number of sets,
the relative error you obtain tends to be in relation to the size of the union of the
sets involved. For example, if you have two sets, each of cardinality 5000 and observe
both sets through HyperLogLog counters with parameter b=10 (3% relative error), you can
expect the intersection estimate to be within 10000 * 0.03 = 300 of the actual intersection
size.
-
For time queries, the relative error applies to the size of the set within the time
range you've queried. For example, given a set of cardinality 1,000,000 that has had
100 distinct additions within the last 10 minutes, if you observe such a set with a
HyperLogLog counter with parameter b=10 (3% relative error), you can expect the count
returned from a query about the last 10 minutes to be within 3 of 100.
Comparison to other approaches
When trying to optimize for space, two well-known alternatives to HyperLogLog exist:
- Bit vectors: you provide some near-perfect hash function between keys in your domain
and an interval of integers, then represent that interval of integers with bits.
- Bloom filters with counters: use a Bloom filter
to keep track of items seen; on insert, when the Bloom filter tells you that the item
seen is not in the set, increment the counter.
Both bit vectors and bloom filters can be augmented to hold timestamps for entries in the
data structures and simulate counters for time-ranges like HyperLogLog::TimeSeriesCounter
.
Bit vectors give exact counts, but the space complexity is linear with the size of
the set, and you must either allocate a large bit vector upfront or cope with the complexity
of dynamically resizing your bit vector as the set grows. Providing a manual mapping from
members of your set to an interval of integers is sometimes a non-trivial task. Counts,
unions, and intersections are all linear-time operations in the size of the universe of
the set being represented.
Bloom filters can be much more compact than bit vectors, but the actual count associated
with a Bloom filter is an artifact of the construction of the data structure, so the cost
of estimating a union or intersection is linear in the size of the Bloom filter. Getting
high probability guarantees on the quality of the estimate of Bloom filter counts requires
several "good" hash functions that have some degree of independence from each other; in
practice, coming up with several independent implementations of good hash functions is
difficult. Bloom filters require that all of their space be allocated upfront (re-hashing
isn't possible without replaying all events), so in practice you need some estimate of
how large the counters are going to be before allocating the counter.
HyperLogLog counters take up less space than either of the above approaches and provide
constant-time implementations (in the size of the sets being represented) of unions,
intersections, and time range queries. A HyperLogLog::Counter
with parameter b will
be stored in a Redis string of length at most 2 ** b bytes, whereas a HyperLogLog::TimeSeriesCounter
with parameter
b will be stored in a Redis string of length at most 4 * (32 - b) * (2 ** b) bytes. For counters representing smaller sets,
the size taken up by a HyperLogLog::TimeSeriesCounter
can be significantly less. Here
are some examples for specific values of b:
- With b = 7, a
HyperLogLog::Counter
uses at most 128 bytes and a HyperLogLog::TimeSeriesCounter
uses at most 13 KB while providing a relative error of 9%. - With b = 11, a
HyperLogLog::Counter
uses at most 2 KB and a HyperLogLog::TimeSeriesCounter
uses at most 168 KB while providing a relative error of 2% - With b = 16, a
HyperLogLog::Counter
uses at most 64 KB and a HyperLogLog::TimeSeriesCounter
uses at most 4 MB while providing a relative error of less than half a percent.
Installation
gem install hyperloglog-redis