As we mentioned in the first part [0] of this blog, Aerospike 4.0 read/write operations on a single record can be made strictly linearizable [3] (preserving sequential read/write accesses to a record in multiple concurrent execution threads across the entire system) or be made session consistent (also called sequential consistency) [4] (preserving sequential read/write accesses to a record during a single user or client session). In addition to providing correctness guarantees on single row operations, Aerospike 4.0 works to preserve as much availability as possible during failures. We addressed these goals in two phases.
Challenges and Solutions
In the first phase, we improved consistency as much as possible in AP mode by adding schemes that enable Aerospike to linearize read/write operations in all except two situations: when a split-brain partition occurs on the cluster, and when a number of nodes at or more than the replication count are simultaneously lost or unavailable. Specifically, we did the following (available in Aerospike version 3.13 and later):
We improved consistency during failures by ensuring there are always enough copies of data in the system to maintain the replication-factor limit. Partition copies are retained on node additions until the new location of the partition is fully available. While this uses up more storage and memory resources during cluster changes, the improvement in correctness in this highly available mode more than makes up for this.
We improved durability by implementing Durable Deletes that use “tombstone” entries to persistently store meta-data on recently deleted records, allowing both a policy of “expunging” data immediately from indexes and a tombstone-based methodology. Durable deletes deliver the atomicity and durability guarantees on all changes to database records, including inserts, updates and deletes.
We improved clustering by using a more deterministic scheme, where members of cluster and cluster formation is done through much more optimized and precise levels of protocol among the nodes in cluster. This will enable more controlled situations of splits, single-node failures, etc. and reduce by an order of magnitude the amount of data transferred between nodes during cluster change events.
Note that, in Availability mode, availability of the system for writes is never compromised since at least one master for every data item is available in the cluster at all times (even during network partitions). By choosing Availability mode, conflicting writes must be accepted during certain failure situations. However, when the system is not undergoing a split brain, Aerospike provides a very high level of Consistency in Availability mode.
In the second phase, we enabled Aerospike to operate in strong consistency mode, as follows:
Atomic transfer of master: In a distributed system, the issue of transfer of master from one node to another is critical during failure situations. Our algorithm ensures that there is, at most, one master for a specific data item at all times.
Master restriction: To guarantee that no more than one master is ever available, information about the nodes participating in the cluster must be maintained to allow a subset of nodes to determine with certainty whether it may maintain or promote masters of the data.
Hybrid Clock: In the case of a master transition, to ensure that the handoff is atomic and ordered through network interaction, a clock is required that is both high performance and results in ordered writes. The system constructs such a hybrid clock by combining three components: (i) an element called regime, which changes as master ownership is transferred, (ii) the local clock value on the node, and (iii) a sub-millisecond counter.
The hybrid clock allows – at the current level of granularity – 30 seconds of clock drift and a capacity of a million writes per second per key. The Aerospike heartbeat system has also been improved to detect clock drift, thus not fully depending on external clock synchronization mechanisms.
Replication integrity: To guarantee strong consistency, the algorithm maintains replication integrity, using a redo mechanism, to ensure that no reads/writes are allowed when replicas are in an unknown state.
Client Intelligence: To ensure that reads and writes stay linearizable (both within session and global), the client participates by keeping track of partition states from the cluster (each partition’s master regime, which is part of the hybrid clock).
The atomic transfer of master, master restriction, hybrid clock, replication integrity, and client intelligence together guarantee that all read and write operations to the database are linearized.
Balancing act
Please note that a key tradeoff exists between complexity of the scheme and degree of availability, i.e., the more available a strongly consistent system, the higher the complexity of implementation. For example, a simple implementation of a strongly consistency mode would essentially block all reads and writes to the system whenever any of the cluster nodes are not available, i.e., the cluster is not whole. A simple rolling upgrade that takes down one node at a time to upgrade software (a routine operation) would result in a complete loss of availability. Our goal is to maintain both availability and consistency during minor hardware outages or planned maintenance. We placed stringent requirements to ensure that we had a high level of operational ease of use during such situations.
Requirements during a software upgrade:
Upgrades will be done by taking down one cluster node at a time, updating the software on the node, and bring it back into the cluster – known as a rolling upgrade. This implies that a node running the previous software version and another one running the upgraded software version can coexist in the same cluster during the rolling upgrade with no loss in database service.
When a node is taken down for a rolling upgrade, there can be no loss of availability of any data in the cluster for reads and writes (this assumes a replication factor of 2 or more, of course).
When a node returns to the cluster after the upgrade, the next node in the rolling upgrade list can be taken down immediately while preserving strong consistency.
Requirements when nodes are added/removed from the cluster:
For node additions, operational procedures need to be simple and should allow multiple new nodes to be added to the cluster at the same time.
For node removals (not failures), the operational procedures can be a bit more complex. E.g., removing nodes from a cluster needs to be done a bit more carefully especially if the system has only two copies of data.
The undercurrents
Higher complexity can provide higher availability. However, it is important to not go too far up the complexity curve – your code base ends up with diminishing returns. Aerospike has implemented a system that preserves the following invariants (to be improved with time):
For sunny day scenarios – situations without ongoing hardware failures or network congestion – Aerospike preserves much of its high performance present in Availability mode in the Strong Consistency mode. The system currently works as follows:
For session consistency or sequential consistency [4], in a 2-copy setup, Aerospike 4.0 can execute reads/writes with the same highest performance that you can obtain in AP mode.
For session consistency or sequential consistency, as the number of copies becomes 3 or more, there is a bit more overhead during sunny-side processing of writes only, mostly in terms of additional network packets (metadata only) for 2-phase coordination between replicas of a single record for writes. This is necessary to notify all replicas that the record is fully replicated. Reads are still at the highest performance levels independent of number of copies except for the fact that they can be affected by the write load.
For strict linearizability, there is an additional overhead for co-ordinating reads within the cluster by consulting all the copies of the record (metadata only). This is required to ensure that reads across clients are globally ordered across time.
For rainy day scenarios, – when a cluster splits into two or more split brains, we experimented with how much of the data can we make available without compromising the linearizability guarantees. We eventually settled on the following:
Specifically, during any two-way split-brain situation, we found a way to make all the data available somewhere in the cluster, while preserving consistency.
As the number of split-brains increases to three or more, subsets of data (partitions) become unavailable in a graceful degradation of availability as the situation becomes more and more complex. Last, but not least, we found that we were able to add these consistency features without significant impact on the straight-line performance of the system during normal operations including scheduled maintenance tasks like rolling upgrades.
In Summary
The past 7 years have taught us to continuously innovate in DB space. We don’t take status-quo as a guarantee. Aerospike 4.0 is the result of several years of improvement to an extremely fast and scalable system as described in [1]. We expect the presence of a high performance system with strong consistency will enable real-time customer engagement applications to penetrate into the core of all enterprises and help their top-line. The beauty of Aerospike 4.0 is that it preserves the industry leading performance, scalability, reliability and low TCO while adding strong consistency. This enables it to be used as a system of record for real-time applications, becoming a cornerstone of the digital transformation that is rapidly taking place in today’s enterprises. All enterprises need to engage in these areas in order to be relevant to the future and Aerospike stands in the forefront ready to help them with their ongoing digital transformation.
The 7 year itch is working well for us.
For more information:
[0] The 7-Year Itch: How Aerospike Decided to Transform the Database Status Quo, Part 1
[1] Aerospike: Architecture of a Real-Time Operational DBMS Proceedings of VLDB, 2016.
[2] CAP Theorem: https://en.wikipedia.org/wiki/CAP_theorem
[3] Linearizability vs Serializability: http://www.bailis.org/blog/linearizability-versus-serializability/
[4] Strong Consistency Models: https://aphyr.com/posts/313-strong-consistency-models