@fluidframework/merge-tree
MergeTree is not a complete DDS by itself, but provides a reusable data structure for DDSes that must maintain a
sequence of collaboratively edited items. MergeTree is used in both SharedSequence and SharedMatrix.
See GitHub for more details on the Fluid Framework and packages within.
Operations
The three basic operations provided by MergeTree are:
insert(start, segment)
remove(start, end)
annotate(start, end, propertySet)
Implementation
MergeTrees represent a sequence as an ordered list of segments. Each segment contains one or more consecutive values in
the sequence. For example, a SharedString contains segments of characters:
["The cat"], [" sat on the mat."]
Traversing all segments in order produces the current sequence as understood by the local client.
"The cat sat on the mat."
(Note that how the items contained in the MergeTree are grouped into segments is a MergeTree implementation detail and
changes over time.)
Local Operations
To process operations like insertion and removal, the MergeTree maps positions in the sequence to the containing segment
and offset of the position within the segment. While the MergeTree implementation uses a B+Tree to accelerate this
mapping, to understand the semantics of the MergeTree it is easier to consider a naive implementation that searches
for the containing (segment, offset) by walking all segments in order. This naïve search subtracts the length of each
segment from the desired position until it reaches the segment that contains the remaining offset.
position 10 -> { segment: [" sat on the mat."], offset: 2 }
Initially considering only local edit operations, insertion and deletion work by inserting new segments or tombstoning
removed segments. Tombstoned segments retain their position in the sequence, but have a length of zero when traversing
the tree.
When an insertion/deletion occurs at a position contained within an existing segment the original segment is "split".
In the case of insertion, the newly inserted segment is inserted between the two halves of the original. In the case of
removal, the removed part of the subdivided segment is tombstoned.
insert(12, "quietly") -> ["The cat"], [" sat "], ["quietly "], ["on the mat."]
remove(19, 30) -> ["The cat"], [" sat "], ["quietly"], [del: " "], [del: "on the mat"], ["."]
Remote Operations
To support merging edit operations from remote clients, we need to extend our original search function
(position) -> (segment, offset)
to account for the state of a remote client's MergeTree at the time the
remote client performed the operation on its MergeTree.
Conceptually, this is done by adjusting our naive linear search for the (segment, offset) in the following way:
- Segments inserted "after" the remote client's operation are skipped (i.e., have length 0)
- Segments tombstoned "after" the remote client's operation, but were inserted "prior" are included
(i.e., have their original length prior to tombstoning.)
...where "after" means the remote client's MergeTree had not yet applied the operation that inserted and/or
tombstoned the segment.
For clients to be able to reason about which segment insertions/removals other clients have processed the
MergeTree we do two things:
- The MergeTree tracks which client inserted/removed each segment and the seq# assigned by the Fluid service to the
insertion/removal operation.
- When sending a MergeTree op, the client includes the last seq# it has processed from the Fluid service. This number
is known as an op's "reference sequence number" or "refSeq#"
The 'client' and 'refSeq' become new arguments to our search function:
(client, refSeq, position) -> (segment, offset)
A segment was inserted and/or removed on the remote client at the time client sent the operation if either:
- The referenced sequence number is greater than or equal the server-assigned sequence number of the operation
that inserted/removed the segment.
- The client sent the operation that resulted in insertion/removal. (In which case, the client hadn't yet recieved
their sequenced op from the server but was aware of the insertion/removal because the client produced it locally.)
If both above conditions are false, then the insertion/removal happened "after" the remote operation, and
consequently should be ignored during the search.
Note that any locally applied operations that are still pending sequencing by the Fluid service are unknown to
remote clients and should be ignored when processing remote ops.