Three-Dimensional Data Storage

Author’s note: this is one of the more esoteric documents on this site, and presumes a cosy familiarity with the Willow specifications for any of it to be useful.

Most implementations of Willow will centre around a (persistent) database of Entries. In this document, we give some context on building such a data store. We cannot possibly give a complete account, think of this document more as a tour guide to an extensive, partially unexplored rabbit hole.

Requirements

The difficult part in storing any kind of data is to enable efficient access and manipulation. The patterns of data access that a Willow implementation needs to support are primarily determined by two factors: how do applications access the data, and how do several data stores sync their contents?

The bare minimum functionality that a Willow database has to provide to applications is the creation of new Entries, and retrieval of Payloads by Entry. The three-dimensionality of Willow suggests a natural way for applications to access data in bulk: by querying for all Entries in an Area11Queries for 3dRanges are not particularly meaningful in the face of end-to-end encrypted data; we recommend to always use Areas in human-facing components such as programming APIs..

As for the requirements of syncing, we shall use the WGPS as a (somewhat demanding) baseline. The WGPS needs to compute Fingerprints for arbitrary 3dRanges, to split arbitrary 3dRanges into multiple, roughly equally-sized subranges, and to constrain Areas to a number of newest Entries (to work with AreaOfInterests).

This gives us a fairly compact feature set, revolving around spatially constrained access to a three-dimensional arrangement of Entries.

Multidimensional Data Structures

Multidimensional search data structures have been studied for decades, whether in database research, computer graphics, geoinformatics, or several other fields of computer science. The bad news is that — unlike one-dimensional search data structures such as self-balancing search trees — no one-size-fits-all data structures with acceptable worst-case complexity profiles are known. The good news, however, is that this has not stopped decades worth of engineering power to develop systems that easily satisfy all practical requirements.

Unfortunately, Willow requires some features not found in off-the-shelf solutions. While there exist several solutions for managing and querying multidimensional data indexed by fixed-width integers or floating point numbers, lexicographically sorted strings of strings (i.e., Paths) are a different beast. Fingerprint aggregation is another feature you will not find in SQLite.

Hence, we now give pointers to some standard implementation techniques for multi-dimensional data structures. For more in-depth introductory coverage, we recommend part four of the Handbook of Data Structures and Applications22Mehta, Dinesh P., and Sartaj Sahni. Handbook of data structures and applications. Chapman and Hall/CRC, 2004..

All of these data structures are tree-based, so they can readily be adapted to store the Fingerprints of the items of each subtree in every tree node, hence allowing for efficient Fingerprint computation. Similarly, storing the total number of items in each node allows for efficiently working with AreaOfInterests, and can guide the splitting of 3dRanges for set reconciliation. Adopting the corresponding one-dimensional algorithms to the multidimensional tree structures is far from trivial, but ultimately doable. At the end of the day, a Willow implementation will thus rise and fall with the quality of its three-dimensional range queries.

Engineering a three-dimensional data store is no easy feat, but the existence of numerous production-quality systems for working with multidimensional data shows that it is feasible. The whole design of Willow rests on this assumption, as well as on the assumption that three-dimensional data organisation suffices for building useful applications. While knowing full well that most contemporary peer-to-peer projects settle for more simple and less expressive data models, we believe that Willow hits a sweet(er) spot between expressivity and implementability.

Just the Beginning...

Three-dimensional storage lies at the heart of a performant Willow database, but any implementation effort must ask several further questions. How do applications interface with the data store? Can they subscribe to updates in real time? Or even request replays of changes after reconnecting to the database after a longer time span? Who controls which payloads should be requested, persisted, deleted? Can applications issue atomic transactions? Should there be indexes for efficient access beyond the three-dimensional data model, based on payload contents?

Each of these questions is an exciting project on its own, and we are committed to exploring them further.