Package lunk provides a set of tools for structured logging in the style of
Google's Dapper or Twitter's Zipkin.
When we consider a complex event in a distributed system, we're actually
considering a partially-ordered tree of events from various services,
libraries, and modules.
Consider a user-initiated web request. Their browser sends an HTTP request to
an edge server, which extracts the credentials (e.g., OAuth token) and
authenticates the request by communicating with an internal authentication
service, which returns a signed set of internal credentials (e.g., signed
user ID). The edge web server then proxies the request to a cluster of web
servers, each running a PHP application. The PHP application loads some data
from several databases, places the user in a number of treatment groups for
running A/B experiments, writes some data to a Dynamo-style distributed
database, and returns an HTML response. The edge server receives this
response and proxies it to the user's browser.
In this scenario we have a number of infrastructure-specific events:
This scenario also involves a number of events which have little to do with
the infrastructure, but are still critical information for the business the
There are a number of different teams all trying to monitor and improve
aspects of this system. Operational staff need to know if a particular host
or service is experiencing a latency spike or drop in throughput. Development
staff need to know if their application's response times have gone down as a
result of a recent deploy. Customer support staff need to know if the system
is operating nominally as a whole, and for customers in particular. Product
designers and managers need to know the effect of an A/B test on user
behavior. But the fact that these teams will be consuming the data in
different ways for different purposes does mean that they are working on
In order to instrument the various components of the system, we need a common
We adopt Dapper's notion of a tree to mean a partially-ordered tree of events
from a distributed system. A tree in Lunk is identified by its root ID, which
is the unique ID of its root event. All events in a common tree share a root
ID. In our photo example, we would assign a unique root ID as soon as the
edge server received the request.
Events inside a tree are causally ordered: each event has a unique ID, and an
optional parent ID. By passing the IDs across systems, we establish causal
ordering between events. In our photo example, the two database queries from
the app would share the same parent ID--the ID of the event corresponding to
the app handling the request which caused those queries.
Each event has a schema of properties, which allow us to record specific
pieces of information about each event. For HTTP requests, we can record the
method, the request URI, the elapsed time to handle the request, etc.
Lunk is agnostic in terms of aggregation technologies, but two use cases seem
clear: real-time process monitoring and offline causational analysis.
For real-time process monitoring, events can be streamed to a aggregation
service like Riemann (http://riemann.io) or Storm
(http://storm.incubator.apache.org), which can calculate process statistics
(e.g., the 95th percentile latency for the edge server responses) in
real-time. This allows for adaptive monitoring of all services, with the
option of including example root IDs in the alerts (e.g., 95th percentile
latency is over 300ms, mostly as a result of requests like those in tree
For offline causational analysis, events can be written in batches to batch
processing systems like Hadoop or OLAP databases like Vertica. These
aggregates can be queried to answer questions traditionally reserved for A/B
testing systems. "Did users who were show the new navbar view more photos?"
"Did the new image optimization algorithm we enabled for 1% of views run
faster? Did it produce smaller images? Did it have any effect on user
engagement?" "Did any services have increased exception rates after any
recent deploys?" &tc &tc
By capturing the root ID of a particular web request, we can assemble a
partially-ordered tree of events which were involved in the handling of that
request. All events with a common root ID are in a common tree, which allows
for O(M) retrieval for a tree of M events.
To send a request with a root ID and a parent ID, use the Event-ID HTTP
The header value is simply the root ID and event ID, hex-encoded and
separated with a slash. If the event has a parent ID, that may be included as
an optional third parameter. A server that receives a request with this
header can use this to properly parent its own events.
Each event has a set of named properties, the keys and values of which are
strings. This allows aggregation layers to take advantage of simplifying
assumptions and either store events in normalized form (with event data
separate from property data) or in denormalized form (essentially
pre-materializing an outer join of the normalized relations). Durations are
always recorded as fractional milliseconds.
Lunk currently provides two formats for log entries: text and
JSON. Text-based logs encode each entry as a single line of text, using
key="value" formatting for all properties. Event property keys are scoped to
avoid collisions. JSON logs encode each entry as a single JSON object.