Strong consistency
This page describes how to use the strong consistency (SC) mode in Aerospike Database. The design of its consistency modes is discussed in the architecture guide.
Strong Consistency is available only in Aerospike Database Enterprise Edition and Aerospike Cloud Managed Service.
Guaranteed writes and partition availabilityโ
SC mode guarantees that all writes to a single record will be applied in a specific, sequential order, and writes will not be re-ordered or skipped; in other words data will not be lost. The following are exceptions to this guarantee:
A cluster node's Aerospike process (asd) pauses for more than 27 seconds, or the clock skew between cluster nodes is greater than 27 seconds.
infoThe clock skew threshold of 27 seconds is based on the default heartbeat
interval
of 150 ms and the default heartbeattimeout
of 10. If those defaults are changed, the clock skew threshold is increased or decreased accordingly.
In the case of clock skew, Aerospike's gossip cluster protocol continually monitors the amount of skew. It sends an alert if the skew of any node becomes large, and disables writes to the cluster (at 20 seconds for default settings) before data loss would occur (at 27 seconds default skew).
A
replication-factor
(RF) number of servers shutting down uncleanly andcommit-to-device
is not enabled.infoA clean shutdown starts with a SIGTERM signal and ends with the Aerospike process (asd) stopping after logging
finished clean shutdown - exiting
. This ensures that asd flushed data to disk and properly signalled other servers in the cluster. Server crashes and other signals (SIGKILL, etc) cause unclean shutdowns.Data devices on RF number of cluster nodes in the roster were wiped while asd was shut down.
Unavailable partitionsโ
In each failure scenario affecting a namespace running in SC mode, Aerospike attempts to provide the best availability and minimize potential issues.
If nodes that are expected in the roster are missing from the cluster, their data partitions might become unavailable. Read-only access to records in an unavailable (or dead) partition is allowed under the most relaxed SC read mode ALLOW_UNAVAIALBE
(allowing stale but not dirty reads), and for reloading of recent changes.
The Consistency architecture guide describes the concepts of the roster, full partitions and subsets, and the simple rules governing whether a partition becomes unavailable when an RF number of rostered nodes are missing from the cluster. If less nodes go down than the RF, all partitions remain available.
The initial step to deal with unavailable partitions is to restore the missing nodes by fixing the hardware or network issues causing them to be out of the cluster.
Dead partitionsโ
When a node goes down due to an unclean shutdown (as described above), its data will not be immediately trusted when it starts up and rejoins the cluster. The node is marked with an "evade flag" (AKA "e flag"), and is not counted in the super-majority math. At this point unavailable partitions might transition to dead partitions to designate that they have potentially suffered lost writes.
Operators should inspect the scenario that lead to the dead partitions, consider if writes were actually lost, and decide whether to manually revive them. There are several mitigations that can be used to significantly reduce the chances of actual data loss, and an auto-revive configuration allows operators to direct Aerospike to skip waiting for manual intervention before reviving dead partitions in certain situations. See below for more details.
Shutting down a node before it has completed a fast restart also causes an unclean shutdown. It does not lead to data loss but may lead to dead partitions. This scenario can be safely mitigated with a revive.
Manually reviving dead partitionsโ
If RF cluster nodes went down due to an unclean shutdown there is a risk of lost writes if all nodes went down within a time interval less than flush-max-ms
. If the operator knows this is not the case, the dead partitions may be safely revived.
If data devices were wiped on RF number of nodes the operator will need to recover this data from an external source, if available. Otherwise, reviving the dead partitions in this scenario will lead to lost data. Consider disabling the applications, reviving the damaged namespace, restoring from the external trusted source, and then enabling applications.
Auto reviveโ
Aerospike Database 7.1 introduced the auto-revive
configuration, which revives dead partitions caused by the scenario where RF roster nodes went down due to unclean shutdowns. Auto revive is selective; it will not revive partitions in the scenario of RF nodes having their storage devices wiped. That case must be manually mitigated, as described above.
Mitigating the effects of unclean shutdownsโ
The effect of an unclean shutdown can be completely avoided by setting the commit-to-device
namespace option.
With this option, simultaneous crashes do not cause data loss, and never generate dead partitions.
However, enabling commit-to-device
generates a flush on every write, and comes with performance penalties except for low write throughput use cases.
When using the persistent memory (PMem) storage for the namespace commit-to-device
suffers no noticeable performance penalty. Similarly, if the namespace data storage is shared memory (RAM) without storage-backed persistence, as of Database 7.0 commit-to-device
has no performance penalty at all.
Reducing the chance of lost dataโ
When commit-to-device
isn't used due to performance considerations, a rack-aware deployment will reduce the chance of RF number of nodes going down due to unclean shutdowns within one flush-max-ms
interval. In a cloud deployment each rack lives in a different Availability Zone (AZ), which are independent datacenters and have separate hardware resources (power, networking, etc). An operator may choose to use auto-revive
in a multi-AZ deployment based on the reduced likelihood of this failure scenario.
Configuring for strong consistencyโ
SC mode is enabled on an entire namespace. For information about enabling SC mode on a namespace, see Configuring Strong Consistency.
Managing strong consistencyโ
Manage Nodes in Strong Consistency describes adding and removing nodes, starting and stopping servers cleanly, how to validate partition availability, and how to revive dead partitions.
It also describes the auto-revive
feature added in Database 7.1.
Using Strong Consistencyโ
The following sections discuss new API functionality related to SC. In general, SC mode is transparent to the developer. Primarily, you simply know that data is safe.
However, there are some new capabilities, managed through the client Policy object, as well as differences in error code meanings.
Linearizable readsโ
A new field exists on the Policy
object. In order to attain linearizable reads, you must set the linearizableRead
field to true
. If you set this field to a read on a non-SC configured namespace, the read will fail.
If you do not set this field on an SC namespace, you will get Session Consistency.
The Policy
object can be set in the initial AerospikeClient constructor call.
All subsequent calls using the constructed AerospikeClient object will, by default, use that value.
Otherwise, you should use this constructed Policy object on individual operations to denote whether
to execute a fully linearized read, or a Session Consistency read.
Several Policy objects - BatchPolicy
, WritePolicy
, QueryPolicy
, and ScanPolicy
- inherit from the Policy
object.
Of these, the new linearizableRead
field only makes sense for the default object and the inherited
BatchPolicy object, where it is applied for all elements in the batch operation.
InDoubt errorsโ
A new field on all error returns called InDoubt
has been added. This is to denote the difference where a
write has certainly not been applied, or may have been applied.
In most database APIs, such as the SQL standard, failures like TIMEOUT are "known" to be uncertain, but there is no specific flag to denote which errors are uncertain. Common practice is to read the database to determine if a writes has been applied.
For example, if the client driver timed out before contacting a server, the client driver may be certain that the transaction was not applied; if the client has attached to a server and sent the data but receives no response over TCP, the client is unsure whether the write has been applied. The Aerospike API improvement allows the client to denote which failures have certainly not been applied.
We believe this flag can be useful for the advanced programmer and reduce cases where an error requires reading from the database under high stress situations.
Data unreachable errorsโ
In SC, there will be periods where data is unavailable because of network partition or other outage. There are several errors the client could present if data is, or becomes, unavailable.
These errors already exist in Aerospike, but have important distinctions when running with SC. Here are the list of errors for some of the main client libraries:
Four different errors can be seen when a cluster has partitioned, or has multiple hardware failures resulting
in data unavailability (using here the Java Client error codes): PARTITION_UNAVAILABLE
,
INVALID_NODE_ERROR
, TIMEOUT
and CONNECTION_ERROR
.
Other clients may have different error codes for those conditions. Refer to the Client Library specific error code tables for details.
The error PARTITION_UNAVAILABLE
is returned from the server when that server determines that its cluster
does not have the correct data. In specific, the client can connect to a cluster, and has received a
partition table that includes this partition data initially. In this case, the client will send a
request to the most recent server it has heard from. If that server is part of a cluster that no
longer has data availability, the PARTITION_UNAVAILABLE
error will be returned.
The error INVALID_NODE_ERROR
is generated by the client in cases where the client does not have
a node to send a request to. This happens initially when the specified "seed node" addresses are incorrect,
when the client can't connect to the particular node(s) in the list of seeds and thus never receives a full partition map,
and also when the partition map received from the server does not contain the partition in question.
INVALID_NODE_ERROR
will also occur if the roster has been mis-configured, or if the
server has dead_partitions
or
unavailable_partitions
(and needs maintenance or re-clustering).
If you get this error, validate that data in the cluster is available using the steps above.
Two other errors, TIMEOUT
and CONNECTION_ERROR
, will be returned in cases where a network partition has
happened or is happening. The client must have a partition map and thus a node to send the request to,
but can't connect to that node, or has connected but subsequently the failure occurs. The CONNECTION_ERROR
may persist as long as the network partition, and only goes away when the partition has been healed or operator
intervention has changed the cluster rosters.
Consistency guaranteesโ
Strong consistencyโ
SC guarantees a session (a client instance in our SDKs) will observe writes to a record in the order that they are applied. This means that if the client reads the record and then reads it again, this read is guaranteed to be a committed version in the records lineage from the point of the prior read onward. Linearizability extends this guarantee to across all concurrent sessions, meaning that if a session reads the records after another session, then the record is guaranteed to be the same or newer than the one from the other session.
In particular, writes that are acknowledged as committed have been applied, and exist in the transaction timeline in contrast to other writes to the same record. This guarantee applies even in the face of network failures and outages, and partitions. Writes which are designated as "timeouts" (or "InDoubt" from the client API) may or may not be applied, but if they have been applied they will only be observed as such.
Aerospike's strong consistency guarantee is per-record, and involves no multi-record transaction semantics. Each record's write or update will be atomic and isolated, and ordering is guaranteed using a hybrid clock.
Aerospike provides both full linearizable mode, which provides a single linear view among all clients that can observe data, as well as a more practical session consistency mode, which guarantees an individual process sees the sequential set of updates. These two read policies can be chosen on a read-by-read basis, thus allowing the few transactions that require a higher guarantee to pay the extra synchronization price, and are detailed below.
In the case of a "timeout" return value - which could be generated due to network congestion, external to any Aerospike issue - the write is guaranteed to be written completely, or not written at all; it will never be the case that the write is partially written (that is, it can never be the case that at least one copy is written but not all replicas are written). In case of a failure to replicate a write transaction across all replicas, the record will be left in the 'un-replicated' state, forcing a 're-replication' transaction prior to any subsequent transaction (read or write) on the record.
SC is configured on a per-namespace basis. Switching a namespace from one mode to another is impractical - creating a new namespace and migrating data is the recommended means.
Linearizabilityโ
In concurrent programming, an operation (or set of operations) is atomic, linearizable, indivisible or uninterruptible if it appears to the rest of the system to occur instantaneously - or is made available for reads instantaneously.
All accesses are seen by all parallel processes in the same order, sequentially. This guarantee is enforced by the Aerospike effective master for the record, and results in an in-transaction "health check" to determine the state of the others servers.
If a write is applied and observed to be applied by a client, no prior version of the record will be observed. With this client mode enabled, "global consistency" - referring to all clients attaching to the cluster - will see a single view of record state at a given time.
This mode requires extra synchronization on every read, thus incurs a performance penalty. Those synchronization packets do not look up or read individual records, but instead simply validate the existence and health of individual partition.
Session consistencyโ
This mode, called in other databases by the names "Monotonic reads, monotonic writes, read-your-writes, write-follows-reads", is the most practical of strong consistency modes.
Unlike the linearizable model, session consistency is scoped to a client session - which in this case is an Aerospike cluster object on an individual client system, unless shared memory shares cluster state between processes.
Session consistency is ideal for all scenarios where a device or user session is involved since it guarantees monotonic reads, monotonic writes, and read your own writes (RYW) guarantees.
Session consistency provides predictable consistency for a session, and maximum read throughput while offering the lowest latency writes and reads.
Performance considerationsโ
SC mode is similar to Availability mode in performance when used with the following settings:
- Replication factor two,
- Session Consistency.
When the replication factor is more than 2, a write causes an extra "replication advise" packet to acting replicas. While the master does not wait for a response, the extra network packets will create load on the system.
When the linearizability read concern is enabled, during a read the master must send a request to every acting replica. These "regime check" packets - which do not do full reads - cause extra latency and packet load, and decrease performance.
Availability considerationsโ
Although Aerospike allows operation with two copies of the data, availability in a failure case requires a replica during a master promotion, which then requires during the course of a failure three copies, the copy that failed, the new master, and the prospective replica. Without this third potential copy, a partition may remain unavailable. For this reason, a two node cluster - with two copies of the data - is not available in a split.
Exceptions to consistency guaranteesโ
This section describes some operational circumstances which could cause issues with consistency or durability.
Clock discontinuityโ
A clock discontinuity of more than approximately 27 seconds may result in lost writes. In this case, data is written to storage with a time value outside the 27 second clock resolution. A subsequent data merge may pick this uncommitted write over other successfully completing versions of the record, due to the precision of the stored clock values.
In the case where the discontinuity is less than 27 seconds, the resolution of the internal hybrid clock will choose the correct record.
Organic clock skew is not an issue, the cluster's heartbeat mechanism detects the drift (at a skew of 15 seconds by default), and logs warnings. If the skew becomes extreme (20 seconds by default), the node rejects writes, returning the "forbidden" error code, thus preventing consistency violations.
Clock discontinuities of this type tend to occur due to four specific reasons:
- Administrator Error, where an operator executes setting the clock far into the future or far into the past
- Malfunctioning time synchronization components
- Hibernation of virtual machines
- Linux process pause or Docker container pause
Be certain to avoid these issues.
Clock discontinuity due to virtual machine live migrationsโ
Virtual machines and virtualized environments, can result in safe strong consistency deployments.
However, two specific configuration issues are hazardous.
Live migrations, which is the process of moving a virtual machine from one physical chassis to another transparently, causes certain hazards. Clocks move with discontinuities, and network buffers can remain in lower level buffers and be applied later.
If live migrations can be certainly limited to less than 27 seconds, strong consistency can be maintained. However, the better operational process is to safely stop Aerospike, move the virtual machine, then restart Aerospike.
The second case involves process pauses, which are also used in container environments. These create similar hazards, and should not be executed. There are few operational reasons to use these features,
UDFsโ
The UDF system will function, but, currently, UDF reads will not be linearized, and UDF writes that fail in certain ways may result in inconsistencies.
Non-durable deletes, expiration, and data evictionโ
Non-durable deletes, including data expiration and eviction, are not strongly consistent. These removals from the database do not generate persistent "tombstones" so they may violate strong consistency guarantees. In these cases you may see data return. For this reason, we generally require disabling eviction and expiration in SC configurations, however, as there are valid use cases, we allow manual override.
However, there are cases where expiration and non-durable deletes may be required even for SC namespaces, and may be known to be safe to the architects, programmers, and operations staff.
These are generally cases where a few objects are being modified, and "old" objects are being expired and deleted. No writes are possible to objects which may be subject to expiration or deletion.
For example, writes may occur to an object - which represents a financial transaction - for a few seconds, then the object may exist in the database for several days without writes. This object may safely be expired, as the period of time between the writes and the expiration is very large.
If you are certain that your use case is suitable, you may enable non-durable deletes. Use the following
configuration setting in your /etc/aerospike/aerospike.conf
adding a namespace option:
strong-consistency-allow-expunge true
Durable deletes, which do generate tombstones, fully support strong consistency.
Client retransmitโ
If you enable Aerospike client write retransmission, you may find that certain test suites will claim consistency violation. This is because an initial write results in a timeout, and is thus retransmitted, and potentially applied multiple times.
Several hazards are present.
First, the write may be applied multiple times. This write may then "surround" other writes, causing consistency violations. This can be avoided using the "read-modify-write" pattern and specifying a generation, as well as disabling retransmission.
Second, incorrect error codes may be generated. For example, a write may be correctly applied but a transient network fault might cause retransmission. On the second write, the disk may be full, generating a specific and not "InDoubt" error - even though the transaction was applied. This class of error can't be resolved by using the "read-modify-write" pattern.
We strongly recommend disabling the client timeout functionality, and retransmitting as desired in the application. While this may seem like extra work, the benefits of correct error codes while debugging is invaluable.
Secondary index requestsโ
If a query is executed, both stale reads and dirty reads may be returned. In the interest of performance, Aerospike currently returns data that is "In Doubt" on a master and has not been fully committed.
This violation only exists for queries, and will be rectified in subsequent releases.
Durability exceptionsโ
Storage hardware failureโ
Some durability exceptions are detected, marking the partition as a dead_partition
in administrative interfaces.
This happens when all nodes within the entire roster are available and connected, yet some partition data is not available. To allow a partition to accept reads and writes again, execute a "revive" command to override the error, or bring a server online with the missing data. Revived nodes restore availability only when all nodes are trusted.
These cases occur when SC is used with a data in memory configuration with no backing store, or when backing storage is either erased or lost.
Database 7.1 introduced the auto-revive
feature which selectively revives some partitions on startup. The feature is detailed on the Manage Nodes in Strong Consistency page.
Incorrect roster managementโ
Reducing the roster by replication-factor or more nodes at a time may result in loss of record data. The procedure for safe addition and removal from a cluster must be followed. Follow the operational procedures regarding roster management closely.
Partial storage erasureโ
In the case where some number of sectors or portions of a drive have been erased by an operator, Aerospike will not be able to note the failure. Partial data erasure (e.g. malicious user or drive failure) on replication-factor or more nodes may erase records and escape detection.
Simultaneous server restartsโ
By default, Aerospike writes data initially to a buffer, and considers the data written when all the required server nodes have acknowledged receiving the data and placing it in the persistent queue.
Aerospike attempts to remove all other write queues from the system. The recommended Aerospike hardware configuration uses a RAW device mode which disables all operating system page caches and most device caches. We recommend disabling any hardware device write caches, which must be done in a device-specific fashion.
Aerospike will immediately begin the process of replicating data in case of a server failure. If replication occurs quickly, no writes will be lost.
In order to protect against data loss in the write queue during multiple rapid failures, enable commit-to-device
noted elsewhere. The default algorithms cause the data lost to be limited
and provide the highest levels of performance, but this feature can be enabled on an individual
storage-based namespace if the higher level of guarantee is required.
In the case where a filesystem or device with a write buffer is used as storage, commit-to-device
may not prevent
these kind of buffer based data losses.
Upgrading from AP to SC namespacesโ
In general, changing a namespace from AP to SC is not supported. There are cases where consistency violations may occur during the change. We recommend that you create a new namespace, then back up and restore use (the forklift upgrade procedure).