bipf
Binary In-Place Format. A binary format designed for in-place (without parsing) reads,
with schemaless json-like semantics.
motivation
in-place reads
In a database there are many cases where you need to read a bunch of records,
filter out most of it (if one or two fields do not match) and then immediately write
whats left to a network socket. With json, this means parsing possibly hundreds of thousands
of json objects (which is suprisingly slow), and then reserializing whats left.
An inplace format doesn't actually require parsing as a whole at all. You only need to parse
the fields you actually read, and using length delimited fields instead of escapes, means
you do not have to look at every byte to parse a field.
length delimited collections
Unfortunately, most binary json-like formats (such as msgpack and cbor) use element counts
on collections (objects and arrays, in json-land) this means to find the end of a collection,
you have to step past each item in it (including the fields in any object contained inside of it).
However, if the collections are length delimited, meaning marked by the encoded byte length of the object,
not the number of items inside it, then it's easy to jump right to the end of the object in one go.
For this reason, databases (for example, mongodb, and couchdb) use length delimited collections.
format
Every type of field is encoded with a type tag and a length packed into a varint.
This means that short types have a one byte tag, and one byte value.
The type is stored in the lowest 3 bits, and the length the higher bits.
Since a varint stores values up to 128 bits in a single byte, values less than 16 bytes
long have a one byte tag, and values up to 8k long have a two byte tag, values up to 1048576 bytes
have a 3 byte tag, and so on.
<tag: varint(encoding_length(value) << 3 | type)><value>
the type indicates the encoding of the value.
valid types are:
STRING : 0 // utf8 encoded string
BUFFER : 1 // raw binary buffer
INT : 2 // little endian 32 bit integer
DOUBLE : 3 // little endian 64 bit float
ARRAY : 4 // array of any other value
OBJECT : 5 // list of string: value pairs
BOOLNULL: 6 // a boolean, or null.
EXTENDED: 7 // custom type. specific type should be indicated by varint at start of buffer.
All values must have a correct length field. This makes it possible
to traverse all fields without looking at the values. Theirfor it is possible
to quickly jump to any subvalue if you know it's path. If you are looking for
a particular string, you can also skip any with the wrong length!
Since object and array fields also begin with a length, you can jump past
them if you know the do not contain the value you are looking for.
This means that seeking inside a more tree like object is more efficient
than seeking into a more list like object!
performance
This design is optimized for the performance of in-place
reads. Encoding is expected to be slower because
of the need to calculate the length of collections
before encoding them. If encoding is within half as fast
as a format intended for encoding perf, that is good.
Of course, the intention with an in-place read system
is that you encode once and then never decode. Just
pass around the binary object, reading fields out when
necessary.
Because of the length encoding, the ability to update
in-place is very limited (not recommended actualy)
but if you are building a system around immutable data,
that is not much of a problem. Although, since subobjects
are fully valid as an encoded value, you can easily
copy a subobject into a new object, etc, without re-encoding.
benchmark
I did a simple benchmark, where I encoded and decoded
this module's package.json file in various ways. Please
not that I am comparing the performance of code written
in C with code written in javascript. If the javascript
is within 10x the performance of the C then we are doing
well! (and a C implementation would likely close that gap)
The measurement is run 10k operations, then divide by number of
ms taken, higher number means more faster!
benchmark code is in ./test/perf.js
operation, ops/ms
binary.encode 62.61740763932373
JSON.stringify 325.7328990228013
binary.decode 83.40283569641367
JSON.parse 242.13075060532688
JSON.parse(buffer) 198.4126984126984
JSON.stringify(JSON.parse()) 127.55102040816327
binary.seek(string) 500
binary.seek2(encoded) 1219.5121951219512
binary.seek(buffer) 1333.3333333333333
binary.seekPath(encoded) 558.659217877095
binary.seekPath(compiled) 1265.8227848101267
binary.compare() 1785.7142857142858
As expected, binary.encode
is much slower than JSON.stringify
,
but it's only 6 times worse.
But the interesting comparison is JSON.stringify(JSON.parse())
and binary.seek(buffer)
. Often, in implementing
a database, you need to read something from disk,
examine one or two fields (to check if it matches a query)
and then write it to network.
(note: the binary.seek
operation is fairly realistic,
we seek to the "dependencies" object, then look up "varint"
inside of that, then decode the version range of "varint".
So it's two comparisons and decoding a string out)
So, in JSON land, that usually means reading it,
parsing it, checking it, stringifying it again.
This involves reading each byte in the input and
allocating memory for the parsed object.
Then traversing that object in memory and writing
something to a string (more memory allocation,
and all this memory allocation
means the garbage collector needs to handle it too)
But if we have in-place reads, we just read raw binary,
seek into the appropiate places to check wether it's
the objects we want, and then write it to the network
directly. We don't allocate any new memory after
reading it.
Further benchmarks and tests are necessary, but that
it can be this fast using a javascript implementation
is impressive.
cannonicisity
For a system with signatures, it's highly important
that data is cannonical. There should be exactly
one way to encode a given data structure. There are
a few edge cases here that need to be checked for.
(not implemented yet)
- varints must not be zero padded
- chrome and firefox preserve order of object keys,
but any integer keys greater than zero come first,
and are in increasing order.
- the length of subfields must be checked to not excede their container's length. (This is a security issue)
These properties can all be checked by traversing the
tags but without reading the keys or values.
I will not consider this module ready until there
are tests that cover these invalid cases, to ensure
that implementations throw an error.
api
encode, decode, encodingLength
follow the interface
specified by abstract-encoding
encode (value, buffer, start) => length
write value
to buffer
from start.
returns the number of bytes used.
decode (buffer, start) => value
read the next value from buffer
at start
.
returns the value, and sets decode.bytes
to number
of bytes used.
encodingLength (value) => length
returns the length needed to encode value
getValueType (value) => type
returns the type tag that will be used to encode this type.
getEncodedType (buffer, start) => type
get the type
tag at start
types.{string,buffer,int,double,array,object,boolnull,reserved}
an object containing the type tags.
iterate (buffer, start, fn) => void
If the field at start
is an object or array, then iterate
will call the fn
with arguments fn(buffer, pointer, key)
for each subfield. If the field at
start
is not an array or object, this returns -1
. You can stop/abort the
iteration by making fn
return any truthy value.
seekKey (buffer, start, target) => pointer
seek for a key target
within an object. If
getEncodedType(buffer, start) !== types.object
then will return
-1
. Otherwise, seekKey will iterate over the encoding object
and return a pointer to where it starts.
Since this defines a recursive encoding, a pointer
to any valid sub-encoding is a valid start value.
var obj = {
foo: 1,
bar: true,
baz: 'hello'
}
var length = b.encodingLength(obj)
var buffer = Buffer.alloc(length)
b.encode(obj, buffer, 0)
console.log(b.decode(buffer, 0).baz)
console.log(b.decode(buffer, b.seekKey(buffer, 0, 'baz')))
see performance section for discussion on the performance
of seek - if it's only needed to parse a couple of elements,
it can be significantly faster than parsing.
seekPath (buffer, start, array_of_buffers) => pointer
The same as seekKey
, except for a recursive path.
path
should be an array of node buffers, just holding the key values,
not encoded as bipf
.
createSeekPath(path) => seekPath(buffer, start)
compiles a javascript function that does a seekPath.
this is significantly faster than iterating over a javascript
array and then looking for each thing, because it will get optimized
by the js engine's jit compiler.
License
MIT