RESEARCH
Last updated
Last updated
Luis Correa de León @luyzdeleon Version 1.0.1
Pocket Network aims to provide decentralized access to Web3 data via a blockchain network composed of multiple actors. This document aims to provide a decision making framework to support the specification of a Persistence Module that will satisfy the storage, retrieval, update and query needs of every actor in the Pocket Network by analyzing the Pocket Network persistence needs.
By persistence we define any set of data that it’s necessary to a Pocket Network actor continuous participation in the network. The principal philosophy behind this effort is to replicate already existing production level architectures and technologies that have scaled persistent data services to billions of users throughout the world.
The persistence module of Pocket Network 1.0 touches on 3 different key areas: System Deployment, Middleware Persistence Logic and Blockchain State Validation Strategy.
The status quo in blockchain systems architecture is to utilize file system databases, where database engines are usually run in-process, this allows the systems to be self-contained, but other than that, is not the most scalable architecture. To allow for a separation of “middleware” and “database engine”, the Persistence Module should allow a “Client-Server” architecture. The advantages of the “Client-Server” architecture is that it separates the database engine logic from the system, and creates the flexibility that allows the system administrators to deploy to the same machine, to separate machines, or any other build in-between.
This also conforms to the Cattle Approach to infrastructure, where treating infrastructure as Cattle and not Pets, you are able to leverage multiple properties that are desirable for production-grade infrastructure which is what we’re trying to achieve with Pocket Network’s node operations. Find below a comparative list of desired properties between both approaches:
Property | In-Process | Client-Server |
---|---|---|
When we talk about middleware, we refer to the business logic that’s being run by the Pocket Network software. We want to make sure the middleware persistence logic optimizes for Pocket Network’s inherent specific needs, and not the status quo in other blockchain networks. By doing this, we believe we can achieve the best possible result, by trading off less important attributes, for others that will rank higher in the Pocket Network “pyramid of persistence needs”.
To understand Pocket Network persistence needs, we need to understand the main **datasets **that will be persisted to, which are described below:
Block store dataset: Contains all the blocks, transactions and quorum certificates of the Pocket Network blockchain.
Mempool dataset: Contains a list of all transactions submitted to the Pocket Network.
State dataset: Contains the specific Pocket Network state (nodes, apps, params, accounts, etc). The state dataset has the particularity that each copy of the dataset is versioned at each **height **of the **Block store dataset. **The proof of the copy of each version of this dataset is what we will call from now on the state hash, in reference to a hashed version that’s included in each height of the Block store dataset.
Utility dataset: Contains all the utility specific data needed for the different actors of the network to achieve their functions. This dataset is local and only affects this individual node.
**State dataset **cannot be “completely hashed” at each round, as the size of the state is expected to grow bigger on each height.
**State dataset **must leave a trail of “verifications”, meaning every verified version of the database must persist.
All **datasets **must be exportable to a standardized format, and importable from that same standardized format.
The schema of each **dataset **must be versioned and migratable between versions.
All datasets must be verifiable for data integrity and be tolerant and self-healing to corruption.
All writes to persistence must be deterministic and “byte perfect consistency”.
Idempotent writes and updates to the database:
In the previous section, we referred to a desired property called “State dataset versioning”, to satisfy such property we require a data representation data structure, which allows us to represent a particular state of a any given dataset and compare it against copies of that same dataset, this is one of the core pillars of blockchain technology. In Section 7 of the Bitcoin whitepaper you will find a reference to a data-structure called a Merkle Tree. This data-structure allows nodes in the Bitcoin network to only need to include the hash of the Root node of the Merkle Tree in the block header, this way the state can be recalculated by any third party and compared against the root.
This section will compare apples to apples between different implementations and variations of the Merkle Tree, to try and find the one that satisfies the most desirable properties suitable for the Pocket Network implementation.
**A note on datasets: **For the Merkle Tree validation to be accepted, all leaf nodes in the tree must be homogeneous in their structure, so for this research we will consider every homologous collection of data to be a tree by itself.
A formal experiment was conducted and documented in the Analysis of Indexing Structures for Immutable Data paper, which after theoretical review, seems to fit within our framework of desired properties for this data structure and goes into great detail comparing these 3 implementations which we will summarize in the comparative table found below. A simple lexicographic (A-F) scoring system aims to summarize each properties values for each implementation, creating a frame of reference upon which they can be easily compared.
As a conclusion, both of the paper and our simple comparison table, in pure empirical benchmarking Pattern-Oriented-Split Trees (specifically the Forkbase implementation) seems to have the most desirable properties, however it’s implementation is closed source and not viable for the Pocket Network Persistence Module, which means the most suitable candidate between these 3 implementations is the Patricia Merkle Trie.
The Jellyfish Merkle Tree is a data-structure created for the Diem project (formerly Libra), that aims to find the ideal balance between the complexity of data structure, storage, I/O overhead and computation efficiency, to cater to the needs of the Diem Blockchain.
**Note on the table below: **There are no available benchmarking experiments on JMT itself, however thanks to this paper On Performance Stability in LSM-based Storage Systems gives insight on it, given that JMT is theoretically superior to standard LSM implementations.
https://bitcoin.org/bitcoin.pdf
https://medium.com/ontologynetwork/everything-you-need-to-know-about-merkle-trees-82b47da0634a
https://developers.diem.com/papers/jellyfish-merkle-tree/2021-01-14.pdf
Property | Description | Impacted Resource | Answer to Constraint |
---|---|---|---|
Property | Description | Priority |
---|---|---|
Property | Description | Patricia Merkle Tries | Merkle Bucket Trees | Pattern-Oriented-Split Trees |
---|---|---|---|---|
Property | Description | Score |
---|---|---|
Portability: Is the persistent data portable?
TRUE
TRUE
Individual Scalability: Can the middleware be independently scaled from the database engine.
FALSE
TRUE
Fault Tolerance: Can failures be isolated between the middleware and database engine?
FALSE
TRUE
Multi-Process Concurrency: Can multiple processes access the database engine concurrently?
FALSE
TRUE
State dataset versioning
The way in which we verify every version of the State dataset.
Compute, Memory, Storage
1, 2
“Byte-perfect consistency” data encoding
The encoding mechanism of the persistent data.
Compute and Memory
6
Schema definition mechanism
The mechanism used to define the persistent data schema.
Storage
4
Deterministic Write Mechanism
A Mechanism that allows to roll-back faulty writes that might compromise the data integrity of any given dataset
Compute and Storage
5, 6
Idempotent Dataset Updates
The same update operation to a dataset, applied multiple times, must yield the same dataset state.
Storage
7
Active Open-Source Golang Implementation
Actively maintained Golang implementation of the data-structure.
1
Insertion and update performance
The theoretical and practical throughput capabilities for the insert and update operations.
2
Query performance
The theoretical and practical throughput capabilities for the lookup operations.
2
Integrity Verification Performance
The theoretical and practical throughput capabilities of the data integrity check operation.
3
Data deduplication
The rate at which data is de-duplicated across multiple versions of the same dataset.
3
Active Open-Source Golang Implementation
Actively maintained Golang implementation of the data-structure.
N/A (implementation is closed source)
N/A (implementation is closed source)
Insertion and update performance
The theoretical and practical throughput capabilities for the insert and update operations.
B
C
A
Query performance
The theoretical and practical throughput capabilities for the lookup operations.
A-
B
A+
Integrity Verification Performance
The theoretical and practical throughput capabilities of the data integrity check operation.
B
C
A
Data deduplication
The rate at which data is de-duplicated across multiple versions of the same dataset.
A-
C
A+
Active Open-Source Golang Implementation
Actively maintained Golang implementation of the data-structure.
Only exists in Rust (https://github.com/diem/diem/tree/master/storage/jellyfish-merkle)
Insertion and update performance
The theoretical and practical throughput capabilities for the insert and update operations.
WIP
Query performance
The theoretical and practical throughput capabilities for the lookup operations.
WIP
Integrity Verification Performance
The theoretical and practical throughput capabilities of the data integrity check operation.
WIP
Data deduplication
The rate at which data is de-duplicated across multiple versions of the same dataset.
WIP