Willow Data Model

Status: Final

In this document, we define the core data model of Willow.

Willow is a system for giving meaningful, hierarchical names to arbitrary sequences of bytes (called payloads), not unlike a filesystem. For example, you might give the name blogidea1 to the bytestring Dear reader, I've got a great idea.

You also give the name blogidea2 to the bytestring (watch this space).

A (one-dimensional) list containing the two paths "blog/idea/1" and "blog/idea/2", with a stylised file next to each path. Idea 1 shows a lightbulb, idea 2 shows a deeply smug expression.

A little later you overwrite the existing entry at path blogidea2 with I've made a mistake. Willow tracks the timestamp of each assignment, and the new entry overwrites the old one.

The same visualization of paths as before, but now with second dimension, a time axis. The smug-faced file disappears in the first time step, being replaced by a sweating face at a second timestep.

That night you decide it would be best if everyone forgot about the whole thing. By writing a new entry at blogidea, our previous entries are deleted. Think of it as overwriting a directory in a file system with an empty file. We call this mechanism prefix pruning.

The same visualization as before, but both paths and files got deleted by adding the third path "blog/idea" with a stylised file whistling in a totally inconspicuous way. — The entries *prefixed* by blogidea are deleted by a newer entry at that prefix.

Things would be rather chaotic if everyone wrote to the same blog. Instead, entries live in separate subspaces — intuitively, each user writes to their own, separate universe of data. Willow allows for various ways of controlling who gets to write to which subspace, from simple per-user access control to sophisticated capability systems.

Stylised files with friendly icons arranged in a now three-dimensional space. Adding to the path and time dimensions of the preceeding drawings, a depth dimension shows three different people to signify different subspaces. They look happy, one waves to the viewer, good vibes all around. — The three dimensions of Willow’s data model: paths, timestamps, and subspaces.

Willow further allows the aggregation of subspaces into completely independent namespaces. Data from a public wiki should live in a separate namespace than data from a photo-sharing application for my family. Some namespaces should allow anyone to set up subspaces within them, others might require authorisation from a trusted manager. Willow offers a flexible mechanism for using different policies on a per-namespace basis.

Three simplified three-dimensional visualisations of namespaces, each shaded in a different color. One is labeled "family", one "wiki", and one "project". — Three completely independent namespaces.

This constitutes a full overview of the data model of Willow. Applications read and write payloads from and to subspaces, addressing via hierarchical paths. Willow tracks timestamps of write operations, newer writes replace older writes in the manner of a traditional file system. These data collections live in namespaces; read and write access to both namespaces and subspaces can be controlled through a variety of policies.

Now we can almost delve into the precise definition of these concepts.

Parameters

Some questions in protocol design have no clear-cut answer. Should namespaces be identified via human-readable strings, or via the public keys of some digital signature scheme? That depends entirely on the use-case. To sidestep such questions, the Willow data model is generic over certain choices of parameters. You can instantiate Willow to use strings as the identifiers of namespaces, or you could have it use 256 bit integers, or urls, or iris scans, etc.

This makes Willow a higher-order protocol: you supply a set of specific choices for its parameters, and in return you get a concrete protocol that you can then use. If different systems instantiate Willow with non-equal parameters, the results will not be interoperable, even though both systems use Willow.

We give precise semantics to these parameters in the spec proper, the list need not fully make sense on the first read-through.An instantiation of Willow must define concrete choices of the following parameters:

A type NamespaceId for identifying namespaces.
A type SubspaceId for identifying subspaces.
A natural number max_component_length for limiting the length of path components.
A natural number max_component_count for limiting the number of path components.
A natural number max_path_length for limiting the overall size of paths.
A totally ordered type PayloadDigest for content-addressing the data that Willow stores.
Since this function provides the only way in which Willow tracks payloads, you probably want to use a secure hash function.A function hash_payload that maps bytestrings (of length at most $2^{64} - 1)$ into PayloadDigest.
A type AuthorisationToken for proving write permission.
Meadowcap is our bespoke capability system for handling authorisation. But any system works, as long as it defines a type of AuthorisationTokens and an is_authorised_write function.A function is_authorised_write that maps an Entry (defined later) and an AuthorisationToken to a Bool, indicating whether the AuthorisationToken does prove write permission for the Entry.

Concepts

Willow can store arbitrary bytestrings of at most $2^{64} - 1$ bytes. We call such a bytestring a Payload.

A Path is a sequence of at most max_component_count many bytestrings, each of at most max_component_length bytes, and whose total number of bytes is at most max_path_length. The bytestrings that make up a Path are called its Components.

A Timestamp is a 64-bit unsigned integer, that is, a natural number between zero (inclusive) and 2^64 (exclusive). Timestamps are to be interpreted as a time in microseconds since the Unix epoch.

The metadata associated with each Payload is called an Entry:


The metadata for storing a Payload.
struct Entry
The identifier of the namespace to which the Entry belongs.
namespace_id: NamespaceId
The identifier of the subspace to which the Entry belongs.
subspace_id: SubspaceId
The Path to which the Entry was written.
path: Path
Wall-clock timestamps may come as a surprise. We are cognisant of their limitations, and use them anyway. To learn why, please see Timestamps, really?The claimed creation time of the Entry.
timestamp: Timestamp
The length of the Payload in bytes.
payload_length: U64
The result of applying hash_payload to the Payload.
payload_digest: PayloadDigest

A PossiblyAuthorisedEntry is a pair of an Entry and an AuthorisationToken. An AuthorisedEntry is a PossiblyAuthorisedEntry for which is_authorised_write returns true.

a is a prefix of a and of ab, but not of ab.A Path s is a prefix of a Path t if the first Components of t are exactly the Components of s.

We can now formally define which Entries overwrite each other and which can coexist. An Entry e1 is newer than another Entry e2 if

e2.timestamp < e1.timestamp, or
e2.timestamp == e1.timestamp and e2.payload_digest < We require PayloadDigest to be totally ordered because of this comparison.e1.payload_digest, or
e2.timestamp == e1.timestamp and e2.payload_digest == e1.payload_digest and e2.payload_length < e1.payload_length.

A store is a set of AuthorisedEntries such that

all its Entries have the same namespace_id, and
there are no two of its Entries old and new such that
- old.subspace_id == new.subspace_id, and
- new.path is a prefixThis is where we formally define prefix pruning. of old.path, and
- new is newer than old.

When two peers connect and wish to update each other, they compute the joins of all their stores with equal NamespaceIds. Doing so efficiently can be quite challenging, we recommend our Willow General Purpose Sync protocol.Formally, adding a new Entry to a store consists of computing the join of the original store and a singleton store containing only the new Entry.The join of two stores that store Entries of the same namespace_id is the store obtained as follows:

Starts with the union of the two stores.
Then, remove all Entries with a path p whose timestamp is strictly less than the timestamp of any other Entry of the same subspace_id whose path is a prefix of p.
Then, for each subset of Entries with equal subspace_ids, equal paths, and equal timestamps, remove all but those with the greatest payload_digest.
Then, for each subset of Entries with equal subspace_ids, equal paths, equal timestamps, and equal payload_digests, remove all but those with the greatest payload_length.

A namespace is the join over all11No matter in which order and groupings the stores are joined the result is always the same. Stores form a join semi-lattice (also known as a state-based CRDT) under the join operation. stores with Entries of a given NamespaceId. Note that this concept only makes sense as an abstract notion, since no participant in a distributed system can ever be certain that it has (up-to-date) information about all existing stores. A subspace is the set of all Entries of a given SubspaceId in a given namespace.

Willow Data Model

Parameters

Concepts

Further Reading