Durable Deletes
Overviewโ
All deleted objects don't stay deleted across cold starts, where the index is repopulated from storage, unless you use the durable deletes feature. By default, Aerospike drops the primary index entry for a deleted object to reclaim memory. Durable deletes typically free storage when they generate a tombstone, a record without any bins that contains all metadata including the key. Tombstones correctly resolve conflicts and prevent previously persisted versions of deleted objects from resurrecting when the index is repopulated.
Tombstones and strong consistency modeโ
Durable deletes are supported in strong consistency. For namespaces in strong consistency mode, durable deletes are required by default.
Non-durable deletes, including data expiration and eviction, do not generate tombstones and are not strongly consistent. For details about managing non-durable deletes, see Non-durable deletes, expiration and data eviction.
Usageโ
Durable deletes can be specified as a client-side policy on a transaction basis, and used on the following calls:
- Write - applicable only when the last bin is removed, resulting in record delete
- Delete
- Operate - applicable only when the last bin is removed, resulting in record delete
- UDF - applicable only when UDF execution results in a record delete, either through delete call, or last bin removal
Durable deletes depend on accurate clocks. Verify that your clocks are synchronized across the cluster.
Tombstonesโ
Tombstones use RAM and disk space, which is reported in logs and statistics, until tomb raider removes them. Tombstones from durable deletes can be different sizes depending on the set name, how you have XDR configured, the size of stored keys (if present) and whether or not you use bin convergence, as discussed in the Tombstone management section of this article.
Tombstones on cold startโ
On a cold start, disks are scanned to rebuild the in-memory index tree for the records. Version of the records are compared. The version with the most recent last-update-time
(LUT) is brought back if record generation is the tiebreaker. For a record that is durably deleted, the tombstone is just a version that participates in the comparison, and can prevent any older versions of the record from returning. If a tombstone is the most recent version, it is reloaded into index.
Tombstone managementโ
Durable deletes create a tombstone where an object is deleted. A tombstone is a record without any bins.
- It contains all metadata, including key.
The tombstone write is similar to a record update in that:
- Both continue to occupy entry in the index, together with other record entries in the index.
- Both are persisted on disk, together with previous copies of the record on disk.
- Both have the same meta-data as any other record:
- last-update-time โ just like normal update.
- generation โ increments just like normal update.
- Both are replicated at the same replication factor specified on the namespace.
- Both are migrated the same way current records are migrated.
- Both are conflict resolved the same way as data-records.
Similar to data records, memory used by tombstones is reclaimed when they are removed from the in-memory index, and the on-disk copy is eligible for de-fragmentation. Index memory is immediately re-usable. The storage is re-usable based on when the space is defragmented.
Capacity sizing and tombstonesโ
For detailed information on calculating tombstone sizing requirements, see the Data Storage Size section in our Capacity Planning guide.
Sizing considerations described in this section include the size of tombstones:
- From a standard durable delete
- When bin convergence is enabled
- When XDR is involved
Sizing considerations are complex. We recommend that you open a use case and sizing discussion with Aerospike Solutions Architects to explore how using durable deletes affects cluster sizing.
Standard durable deleteโ
The sizing impact of a standard durable delete is as follows:
- Index space = 64 bytes
- Disk space = (35 bytes + set name size + optional key size) rounded to the next 16 (minimum of 48 bytes).
Capacity sizing and bin convergenceโ
Bin convergence is supported with record deletes as long as they are durable deletes. When bin convergence is enabled for a namespace and a record is durably deleted, the delete converts the record to a bin cemetery which maintains the necessary LUTs and src-id.
The sizing calculation for each bin type follows:
- (meta_byte + bin_meta_lut + src_id) +
- (bin_name_sz_byte + bin_name_len) +
- (particle_type_byte + particle_sz)
where:
1. meta_byte + bin_meta_lut + src_id
Convergence feature (metadata) | meta_byte | bin_meta_lut | src_id |
---|---|---|---|
bin has metadata | =1 | =5 | =1 |
bin has no metadata | =0 | =0 | =0 |
2. bin_name_sz_byte + bin_name_len
bin_name_sz_byte = 1
bin_name_len = length of the bin name
3. particle_type_byte + particle_sz
Tombstone | particle_type_byte | particle_size |
---|---|---|
bin is a tombstone | =0 | =0 |
bin is not a tombstone | =1 | =depends on type |
For more information, see Bin Convergence.
Tomb raiderโ
- Another potential impact of durable delete is an increase in large block reads on the SSD as the tomb raider sweeps the disk. If not correctly sized, this could introduce latency into the cluster which could impact transactional operations.
Tomb raider, a special background mechanism, removes tombstones which are no longer needed. The conditions for a tombstone to be removed are as follows:
- There are no previous copies of the record on disk.
- This condition assures that a cold start will not bring back any older copy.
- The tombstone's last-update-time is before the current time minus the configured
tomb-raider-eligible-age
.- This condition prevents a node that's been apart from the cluster for
tomb-raider-eligible-age
seconds, to rejoin and re-introduce an older copy.
- This condition prevents a node that's been apart from the cluster for
- The node is not waiting for any incoming migration.
- The tombstone has been successfully shipped by XDR (if using XDR 5.0 and above).
If all conditions are satisfied, the tombstone is reclaimed.
The actual background thread is split into roughly the following steps:
- Iterating through index to mark all tombstones as candidates for removal (cenotaphs).
- Scan each disk block for records, un-mark cenotaph for each record.
- Iterate through index again. All cenotaphs remaining are candidates for permanent removal.
For non-persisted namespace, tombstone removal is separate and only requires one index iteration for tombstone removal.
Cold start also removes unneeded tombstones as part of the disk reading:
- All tombstones are marked as candidates for removal (cenotaphs) on initial bring-up.
- If a subsequent live record which the tombstone covers is read, cenotaph will be unmarked, and tombstone stays.
- Otherwise, at end of cold start, all cenotaphs will be deleted.
The following configurations are available to control the behavior of the tomb raider:
tomb-raider-period
- minimum amount of time, in seconds, in between runs, default is 1 day (86400).tomb-raider-eligible-age
- number of seconds to retain a tombstone, even though it's discovered to be safe to remove, default is 1 day (86400).tomb-raider-sleep
(storage-only) - number of micro-seconds to sleep in between large block reads on disk, default is 1000 ยตs (1 ms).
Expired and evicted recordsโ
Expired and evicted records do not generate tombstones. This is desirable behavior:
- It allows maximum resource capacity to be used for data records instead of tombstones.
- If resource capacity increases, for example, by increasing the memory capacity on a node, it is possible to revive the non-durably deleted record on cold-start.
There are also other conditions where evicted records may return:
- A replica with a shallower cold-start eviction-time than the master. In this case, when the master node departs, and replica is cold started, the replica records may revive.
Tombstones do not have an expiration time set, and thus are not eligible for eviction. Tombstones have their own deletion mechanism.
Scan, Batchโ
Scan/Batch skip returning tombstoned records.
Conflict resolution policyโ
The conflict resolution policy affects durable deletes behavior across cluster state changes. To guarantee correct propagation of durable deletes, the conflict-resolution-policy
should be "last-update-time". For use cases
sensitive to network partitions, Aerospike recommends that you configure
strong-consistency
Tombstone reportingโ
- Log lines have been updated to report the status of tombstones within the namespace on the node:
Aug 24 2016 22:13:50 GMT: INFO (info): (ticker.c:336) {test} objects: all 4344 master 2209 prole 2135
Aug 24 2016 22:13:50 GMT: INFO (info): (ticker.c:372) {test} tombstones: all 2378 master 1293 prole 1085
- Additionally the following namespace statistics are available to track tombstones:
$ asinfo -v "namespace/test" -l | grep tomb
tombstones=2378
master_tombstones=1293
prole_tombstones=1085
tomb-raider-eligible-age=86400
tomb-raider-period=10000
Client/server compatibilityโ
- Default client policy is NOT durable deletes, to keep backward compatibility.
- Client applications must be enhanced to use durable-delete feature.
Community/Enterprise compatibilityโ
If an Enterprise Edition drive has tombstones, and downgrades to the Community Edition, cold start will fail when reading the tombstones. Drives must be cleaned up for successful restart.
One new error code is introduced:
AS_PROTO_RESULT_FAIL_ENTERPRISE_ONLY
- if a durable deletes policy is issued against a Community Edition server.