Bin convergence

Bin convergence handles write conflicts in mesh or active-active topologies, where the last known value in a time period is more important than all updates. It guarantees consistency in the final state but may lose intermediate updates. See Bin convergence in mesh topology for an overview of its architecture.

Configuring this feature requires specific attention to parameters across multiple sections of the configuration file. Here is the summary of configs that need to be used.

Config Name	Section	Description
`conflict-resolve-writes`	namespace	This should be set to `true` for the receiving namespace to store the bin-level LUTs to resolve conflicts. Not allowed when `single-bin` is `true`.
`src-id`	xdr	A non-zero unique ID should be selected for each DC and the same ID should be given to all the nodes in the DC.
`ship-bin-luts`	xdr->dc->namespace	This should be set to `true` for the sender to send the bin-level LUTs when shipping the record.
`bin-policy`	xdr->dc->namespace	Bin shipping should be enabled by setting this value to one of: `only-changed`, `changed-and-specified`, `changed-or-specified`.

Sample configuration:

namespace someNameSpaceName {
  ...
  conflict-resolve-writes true
}

xdr {
    src-id 1

    dc dataCenter1 {
        node-address-port someIpAdress1 somePort1
        namespace someNameSpaceName {
           bin-policy only-changed
       ship-bin-luts true
        }
    }
}

See the configuration example for an example in a three-datacenter mesh topology.

Overhead

1-byte per bin for src-id in addition to the 6-byte overhead based on the bin-policy. This overhead applies to all records in the namespace after conflict-resolve-writes is set to true, not only to sets using bin convergence.

Having hundreds or more bins per record incurs both a storage cost and a performance impact when using XDR bin convergence.

Restriction on client writes

Record replace not allowed

When bin-convergence is enabled (conflict-resolve-writes is set to true for the namespace), the client cannot do writes with a record replace policy. The record replace policy overwrites whatever is already present without reading the old version of the record on the disk. If some of the old bins are not part of new the record, those bins lose the last update time (LUT). Without the LUT of the deleted bins during the replace, the bins are not able to converge.

To overcome this restriction, the client should use the operate API in the clients instead of put. The operate API provides a way to perform multiple commands in a single call. The application can delete all the bins first (without knowing the bin names), then write the desired new bins. The delete all command converts all the existing bins to bin tombstones with LUTs that help with bin convergence.

Durable Deletes

Aerospike supports bin convergence even with record deletes as long as they are durable deletes. When bin convergence is enabled for a namespace and a record is durably deleted, the delete converts the record to a tombstone but also maintains bin tombstones along with the necessary LUTs and src-id.

These are called “bin cemeteries” which are available as a metric named xdr_bin_cemeteries. Any future updates to the record coming from the local client or a remote XDR write will be evaluated against the LUT & src-id of the bin tombstones. The update succeeds if it comes with a future LUT.

Internally, these durable deletes are converted to writes that delete all the bins. These operations are counted as writes under client_write_success, not under client_delete_success Similarly, these deletes show up as writes in the write latency histogram.

These tombstones are removed from the system by the regular tomb raider. If you change the default behavior of the tomb raider, do not set the value for tomb-raider-eligible-age too low. Set this to a value higher than the potential conflict window plus any expected lag. For more information, see Tomb Raider.

Dependencies

System Clock

Bin convergence relies heavily on timestamps, which are based on the system clock. We expect that the system clocks are set up with the Network Time Protocol (NTP).

Tie-break

The bin-level LUT has a millisecond resolution, which is often enough resolution to avoid ties. In case of a tie, the highest src-id takes precedence. Ensure that the src-id is unique across all clusters connected using XDR to avoid conflicts in case of a tie.

Client write behavior

In some cases, the system clock may go out of sync between nodes in different DCs. If a node’s system clock slips, the client writes may fail with error code 28 (LOST CONFLICT). This may happen because the incoming XDR writes would have written the bins with a higher timestamp. A subsequent local client write uses the local time and may be behind the XDR write’s timestamp. In this case, the client is rejected as the system does not allow the bin-level LUT to go back in time.

Forwarding from the active-active mesh

There are use cases where the writes happening in the active-active mesh DCs needs to be forwarded to one or more read-only DCs. For example, assume 3 DCs - A, B and C - are set up as an active-active mesh. Assume DC A requires forwarding all the writes to a read-only DC D. As bin shipping is based on bin LUTs, with bin convergence, XDR writes may bring in bins with older LUTs than the last ship time of the forwarded DC. The default bin-policy (all) would have to be used for such a use case. For other use cases, forwarding should not be turned on if bin convergence is being used.