We are excited to be a part of AWS re:Invent 2024. Visit us at booth #1844 in Las Vegas.More info
Blog

Aerospike 4.0, Strong Consistency, and Jepsen

brian-bulkowski
Brian Bulkowski
Aerospike Founder and CTO
March 7, 2018|5 min read

“In hundreds of tests of SC mode through network partitions, 3.99.1.5 and higher versions have not shown any sign of nonlinearizable histories, lost increments to counters, or lost updates to sets.” – Kyle Kingsbury, Aerospike 3.99.0.3, 12-27-2017

We believe Jepsen has validated our core claims: within reasonable operational constraints, Aerospike does not lose data, nor does it allow stale or dirty reads even with high concurrency, challenging network conditions, and crashes. In order to test these claims, Kyle ran existing Jepsen tests, but also a variety of new custom-crafted “nemeses”. Kyle also improved Jepsen with a higher performance core engine, as well as fixed Jepsen internal bugs. We believed we could provide consistency and correctness because the Aerospike system, architecturally built with an elected master for each record at an individual point in time, could synchronize an individual write. Our shared-nothing architecture includes distributed record masters, which are robust, and are well suited to a high performance Lamport clock implementation.

Our internal testing has shown high performance to be achievable, even with strong consistency. Our preliminary results prove high performance with session consistency – no real performance loss compared to Aerospike in AP mode, and very high performance with full linearizability.

Linearize SC

Session SC

Availability (AP)

Read Latency

548 μs

225 μs

220 μs

Write Latency

630 μs

640 μs

640 μs

The report discusses three different issues.Section 3.1, “Dirty reads after partitions”, describes an issue that was resolved in 3.99.1.2, and Kyle validated was fixed. The resolution was simple – disable internal retransmissions. Kyle validated this fix in 3.99.1.2. With this in mind, we increased testing in this area, and before 3.99.2.1, resolved other potential issues – such as disabling retransmission by default by clients, added a feature to our client API – an “InDoubt” flag. This feature is rare in the database world, but allows for more efficient error recovery at the application level.In section 3.2, “Node Crashes”, Kyle points out “impolite” ( SIGKILL ) crashes rapidly decrease availability in.3.99.0.2, and data loss due to sequential crashes. We addressed both points. Loss due to sequential rapid server crashes, followed by a ‘revive:’ command, followed by a subsequent crash, were a true issue, and addressed through a change to our “revive:” command, which was delivered to Kyle. Unavailability due to “dead” partitions is primarily caused by potential data loss due to buffered writes, which we addressed via our `commit-to-device` feature.

Based on Kyle’s measurements of unavailability, we accelerated a planned “commit-to-device” feature, which now is available in 4.0. Kyle received the feature in 3.99.2.2, and validated that with this feature enabled, data loss is not observed on server crash. Availability is very high, limited by restart and recluster time. If data replication can happen quickly, there is only the short period of unavailability as the cluster changes. If multiple servers fail very rapidly, data has correctly been flushed to storage, and restarts bring data back online. Performance of this system will be lower, but modern Flash drives use SRAM and capacitors to absorb these kinds of fast writes, and we have updated our Flash validation suite (ACT) – multi-million TPS systems are still very practical.

In section 3.3, “Wall clocks”, Kyle tested our claim that the system would be able to handle clock skew of 27 seconds, and found our system to be consistent and correct within this 27 second limit. In a deployment environment, this can be caused by clock skew within real-time clock hardware. However, the widely available NTP (Network Time Protocol) open source software easily can be configured to achieve millisecond level, or at least sub-second, time synchronization. The Aerospike software provides extra help – the clustering gossip protocol checks for synchronization, warns at 15 seconds of skew, and halts writes in the cluster at 20 seconds of skew.

The other source of clock skew is operating system “pauses”, which can be generated at virtual machine, container and process levels. The first case is virtual machine live migration. In this case, a virtual machine will “pause” as it moves from one hardware chassis to another – and Aerospike should not generally be run on virtual machines with live migration enabled. One cloud provider – Google Compute Engine – requires live migration, but limits pauses to 10 seconds, within Aerospike’s safe range. Process and container pauses can be created by operator intervention, but the valid uses for these commands are rare and must be avoided. We believe 27 seconds is a very acceptable real-world limitation, and Kyle created custom code and “nemeses” that seems to validate our claimed limit of 27 seconds: data was lost with pauses above 27 seconds, none was observed below that limit.

When we came to the end of the test period – with Kyle testing our 3.99.2.1 release, which also went to early access customers – Kyle was unable to find data loss, or linearizability violations, and Kyle also measured the availability benefits of committing each write to a storage device.

Our work with Kyle has been extraordinarily fruitful. We’ve seen the improvements in the Jepsen infrastructure and code base, and we’ve been able to provide the best possible evidence that our statements regarding consistency are true.

Jepsen is the current gold standard for determining data correctness, and Kyle’s tests demonstrate that Aerospike 4.0 is capable of handling the most demanding data environments with strong consistency guarantees.

— —