Durable deletes

The durable deletes feature in Aerospike Database ensures that deleted records stay deleted across cold starts. If you don’t use this feature, deleted objects may reappear when the index is repopulated from storage.

By default, Aerospike reclaims memory by dropping the primary index entry for a deleted object. Durable deletes typically free storage when they generate a tombstone, a record without any bins that contains all metadata including the key. Tombstones correctly resolve conflicts and prevent previously persisted versions of deleted objects from resurrecting when the index is repopulated.

Tombstones and strong consistency mode

For namespaces in strong consistency mode, durable deletes are required by default.

Non-durable deletes, including data expiration and eviction, do not generate tombstones and are not strongly consistent. For details about managing non-durable deletes, see Non-durable deletes, expiration and data eviction.

Introduced in Aerospike Database 8.0.0, transactions generate a durable delete for each monitor record once the transaction is ended. Delete commands in a transaction must be durable deletes. The rate of tombstone creation is tied to the number of transactions that end each second.

Usage

Durable deletes can be specified as a client-side policy on a per-command basis, and used on the following calls:

Write - applicable only when the last bin is removed, resulting in record delete
Delete
Operate - applicable only when the last bin is removed, resulting in record delete
UDF - applicable only when UDF execution results in a record delete, either through delete call, or last bin removal

Tombstones

Tombstones use RAM and disk space, which is reported in logs and statistics until the tomb raider removes them. Tombstones from durable deletes can be different sizes depending on the set name, your XDR configuration, the size of stored keys (if present), and whether you use bin convergence, as discussed in the Tombstone management section on this page.

Tombstones on cold start

During a cold start, Aerospike Database scans disks to rebuild the in-memory index tree for the records. Record versions are compared, and the version with the most recent last-update-time (LUT) is brought back if record generation is the tiebreaker. For a record that is durably deleted, the tombstone is simply a version that participates in the comparison, and can prevent any older versions of the record from returning. If a tombstone is the most recent version, it is reloaded into the index.

Tombstone management

Durable deletes create a tombstone that replaces a deleted object. A tombstone is a record without any bins. It contains all metadata, including the key.

The tombstone write is similar to a record update in that:

Both continue to occupy entries in the index, together with other record entries in the index.
Both are persisted on disk, together with previous copies of the record on disk.
Both have the same metadata as any other record:
- last-update-time – just like normal update.
- generation – increments just like normal update.
Both are replicated at the same replication factor specified on the namespace.
Both are migrated the same way current records are migrated.
Both are conflict resolved the same way as data-records.

Similar to data records, memory used by tombstones is reclaimed when the tombstones are removed from the in-memory index, and the on-disk copy is eligible for defragmentation. Index memory is immediately reusable. The storage is reusable based on when the space is defragmented.

Capacity sizing and tombstones

For detailed information on calculating tombstone sizing requirements, see the Data Storage Size section in our Capacity Planning guide.

Sizing considerations described in this section include the size of tombstones:

From a standard durable delete
When bin convergence is enabled
When XDR is involved

Standard durable delete

The sizing impact of a standard durable delete is as follows:

Index space = 64 bytes
Disk space = (35 bytes + set name size + optional key size) rounded to the next multiple of 16 (minimum of 48 bytes).

Capacity sizing and bin convergence

Bin convergence is supported with record deletes as long as they are durable deletes. When bin convergence is enabled for a namespace and a record is durably deleted, the delete converts the record to a bin cemetery which maintains the necessary LUTs and src-id.

The sizing calculation for each bin type follows:

(meta_byte + bin_meta_lut + src_id) +
(bin_name_sz_byte + bin_name_len) +
(particle_type_byte + particle_sz)

where:

1. meta_byte + bin_meta_lut + src_id

Convergence feature (metadata)	meta_byte	bin_meta_lut	src_id
bin has metadata	=1	=5	=1
bin has no metadata	=0	=0	=0

2. bin_name_sz_byte + bin_name_len

bin_name_sz_byte = 1
bin_name_len = length of the bin name

3. particle_type_byte + particle_sz

Tombstone	particle_type_byte	particle_size
bin is a tombstone	=0	=0
bin is not a tombstone	=1	=depends on type

For more information, see Bin Convergence.

Tomb raider

The tomb raider, a special background mechanism, removes tombstones which are no longer needed.

When data is stored on SSD, an increase in write-block sized reads might happen on the SSD as the tomb raider sweeps the device. If not correctly sized, this could introduce latency into the cluster, potentially impacting your SLAs.
Starting with Aerospike Database 7.0.0, when data is stored in-memory, the tomb raider only reads the memory storage device, so it does not incur any SSD reads even when the namespaces use storage-backed persistence.

The conditions for a tombstone to be removed are as follows:

There are no previous copies of the record on disk.
- This condition assures that a cold start will not bring back any older copy.
The tombstone’s last-update-time is before the current time minus the configured tomb-raider-eligible-age.
- This condition prevents a node that was apart from the cluster for tomb-raider-eligible-age seconds from rejoining and reintroducing an older copy.
The node is not waiting for any incoming migration.
The tombstone has been successfully shipped by XDR (if using XDR 5.0.0 and above).

If all conditions are satisfied, the tombstone is reclaimed.

The actual background thread is split into roughly the following steps:

Iterating through index to mark all tombstones as candidates for removal (cenotaphs).
Scan each disk block for records, un-mark cenotaph for each record.
Iterate through index again. All cenotaphs remaining are candidates for permanent removal.

For non-persisted namespace, tombstone removal is separate and only requires one index iteration for tombstone removal.

Cold start also removes unneeded tombstones as part of the disk reading:

All tombstones are marked as candidates for removal (cenotaphs) on initial bring-up.
If a subsequent live record which the tombstone covers is read, cenotaph will be unmarked, and tombstone stays.
Otherwise, at end of cold start, all cenotaphs will be deleted.

The following configurations are available to control the behavior of the tomb raider:

tomb-raider-period - minimum amount of time, in seconds, in between runs, default is 1 day (86400).
tomb-raider-eligible-age - number of seconds to retain a tombstone, even though it’s discovered to be safe to remove, default is 1 day (86400).
tomb-raider-sleep (storage-only) - number of micro-seconds to sleep in between large block reads on disk, default is 1000 µs (1 ms).

Expired and evicted records

Expired and evicted records do not generate tombstones. This behavior is intentional for the following reasons:

It allows maximum resource capacity to be used for data records instead of tombstones.
If resource capacity increases, for example, by increasing the memory capacity on a node, it is possible to revive the non-durably deleted record on cold-start.

The following are other conditions under which evicted records may return:

A replica with a shallower cold-start eviction-time than the master. In this case, when the master node departs, and replica is cold started, the replica records may revive.

Tombstones do not have an expiration time set, and thus are not eligible for eviction. Tombstones have their own deletion mechanism.

Scan, Batch

Scan/Batch skip returning tombstoned records.

Conflict resolution policy

The conflict resolution policy affects durable deletes behavior across cluster state changes. To guarantee correct propagation of durable deletes, the conflict-resolution-policy should be “last-update-time”. For use cases sensitive to network partitions, Aerospike recommends that you configure strong-consistency

Tombstone metrics and logging

A tombstones ticker log line reflect the tombstones within the namespace on each node:

{test} tombstones: all 11252 xdr (11223,0) master 5501 prole 5751 non-replica 0

Additionally the following namespace statistics are available to track tombstones:

asinfo -v "namespace/test" -l | grep tombstone
tombstones=11252
master_tombstones=5501
prole_tombstones=5751
xdr_tombstones=11223
non_replica_tombstones=0

Client/server compatibility

Default client policy is NOT durable deletes, to keep backward compatibility.
Client applications must be enhanced to use durable-delete feature.

Community/Enterprise compatibility

If an Enterprise Edition drive has tombstones, and downgrades to the Community Edition, cold start will fail when reading the tombstones. Drives must be cleaned up for successful restart.

One new error code is introduced:

AS_PROTO_RESULT_FAIL_ENTERPRISE_ONLY - if a durable deletes policy is issued against a Community Edition server.