Skip to content

Disk usage and defragmentation

This page describes how disk usage and defragmentation can impact cluster performance.

How do writes and updates work?

Aerospike always writes data using an in-memory buffer called a streaming write buffer (SWB). The SWB is 8 MiB in size and accumulates records before flushing them to disk. When the SWB fills up, a new SWB flushes to the next available free block on disk.

Advanced reading
  • The flush-size configuration parameter controls the increment at which the SWB is flushed to disk. commit-to-device forces a flush for every write.
  • A flush-size buffer can also be flushed before it is full based on the flush-max-ms configuration, and partial_writes, metric based on the storage type, represents the number of such partial flushes.
  • The SWB is per-device, and each device has multiple SWBs to prevent polluting the post-write-cache:
  • These details also apply to in-memory namespaces (other than the flushing part itself).
  • write_q, based on storage type and along with its shadow counterpart, and max-write-cache are also relevant.

Aerospike does not perform in-place updates. When a record is updated, that is when a bin is added, updated, or deleted, the entire record is rewritten in the current SWB. The old version is left behind in its original block and is marked obsolete, but still occupying space. Similarly, when a record is deleted, the index entry is deleted, or a tombstone is written for durable deletes, leaving the deleted record in its original block while marking that space as unused. These behaviors lead to fragmentation over time.

Advanced reading
  • Under very specific circumstances, Aerospike rewrites a record in place of the original if the updated record is the exact same size as the original one and within the current flush-size segment.
  • The statistic for disk usage (data_used_bytes) does not account for the unused space within blocks. The available percent (data_avail_pct), however, does account for that space and only reflects the free blocks available to be written into.

What is fragmentation?

Fragmentation occurs when blocks on disk contain obsolete or deleted records, leaving unused space. Over time, as records are updated or deleted, blocks have less used space in them, causing the disk to be more fragmented.

Aerospike reclaims these blocks through continuous background defragmentation, which consolidates the live records into new blocks and returns the original blocks to the free queue.

When does fragmentation happen?

When a block’s used portion falls below a certain threshold, it becomes eligible for defragmentation and is added to the defragmentation queue.

A background process consumes this queue, rewrites the live records in fragmented blocks into new blocks, and marks the old blocks as free. The speed of this process is regulated by a sleep interval between blocks.

What are the key defragmentation tuning parameters?

  • defrag-lwm-pct, or defrag low water mark percent, is the threshold of used space in any given block. A block that falls below this threshold is sent to the defragmentation queue. The default value is 50 percent.
  • defrag-sleep is the speed at which blocks that have fallen below the defrag threshold are processed. The default value is 1000 µs or 1 ms.
Advanced reading
  • The defrag_q statistic, which is based on the storage type, represents the size of the defragmentation queue.
  • The blocks to be defragmented are read using the flush-size increment and sleeping in between those.
  • There are benchmark histogram statistics for the blocks read by the defragmentation thread. See Monitor latency with histograms for more information.
  • When a block is read for defragmentation, all of its records are checked against the primary index to determine which ones are still valid. This process can be particularly expensive if the primary index is not stored on a flash disk rather than in memory.
  • The defrag-sleep configuration does not account for how long it takes to read a block. It is the amount of time to sleep before reading the next block, or flush-size increment.
  • The statistic for disk usage (data_used_bytes) does not account for the unused space within blocks. The available percent (data_avail_pct), however, does account for that space and only reflects the free blocks available to be written into.
  • Blocks that are still in the post-write-cache are not eligible for defragmentation.
  • Blocks that have been defragmented may not be immediately available to be written over. These blocks may still be referenced elsewhere, for example by secondary indexes that require garbage collection to complete.
  • The asinfo -v "dump-wb-summary:ns=namespaceName" info command provides details about the data usage within blocks.

What is write amplification?

Write amplification refers to the phenomenon of more data being written to disk than the actual application workload requires, due to background operations like defragmentation.

  • At the default 50% defrag-lwm-pct, the write amplification is 2X. This means that two blocks that are 50% full need to be defragmented to generate one free block and fill up another block from the half used portion of each block.
  • At 75% defrag-lwm-pct, the write amplification is 4X. This means that four blocks that are 75% full need to be defragmented to generate one free block and fill up another three blocks from those original blocks.
  • At 90% defrag-lwm-pct, the write amplification is 10X.

What are the impacts of adjusting defrag-lwm-pct?

Increasing defrag-lwm-pct

  • When you increase defrag-lwm-pct dynamically, it can often cause a transient surge of blocks in the defrag queue, affecting performance due to increased disk read and write I/O, except for data in memory. The surge of blocks in the defrag queue increases the write amplification, but generates free blocks to get back available percent.
  • In such cases, defrag may throttle, or pause intermittently, based on max-write-cache if the write queue backs up after stopping other types of writes. This can result in “queue too deep” errors.

Decreasing defrag-lwm-pct

  • Decreasing defrag-lwm-pct lowers the write amplification. This can be done temporarily, if there is enough available percent (free blocks) head room, to maximize application workload performance.
  • Sustaining a lower defrag-lwm-pct value runs the risk of running out of free blocks before even getting to using half the storage.

What are the impacts of adjusting defrag-sleep?

Increasing defrag-sleep

  • Increasing defrag-sleep can cause the defrag queue to back up, depending on how quickly blocks are becoming eligible.
  • The longer blocks stay in the defrag queue, the higher the chance that those blocks can be further depleted.
  • This can result in fewer defrag writes and decrease the write amplification.

Decreasing defrag-sleep

  • Decreasing defrag-sleep makes sense if the defrag queue is backing up and there is a risk of getting into low available percent.
  • This will, however, increase disk read and write I/O, except for data in memory.

How do I handle sudden defragmentation activity spikes?

Sometimes, a large number of blocks become eligible to be defragmented at once, possibly due to truncate, mass expirations, mass tombstone deletion at the end of a tomb raider run, or a partition drop during migrations. This can result in a sudden spike in the defrag queue.

What if available percent is a concern?

  • Temporarily reduce defrag-sleep to increase throughput, for example from 1000 → 500, while monitoring disk activity to prevent overwhelming the storage subsystem.
  • Do not change defrag-lwm-pct reactively. It controls eligibility, not urgency.
  • Monitor key disk metrics.
  • Potentially temporarily lower the defrag-lwm-pct until the defrag queue has been consumed.

After the spike is absorbed, restore configuration parameters to their previous values.

What if application performance impact is a concern and there is ample available percent (free blocks)?

  • Temporarily increase defrag-sleep, for example from 1000 → 10000, allowing blocks to deplete further before being processed by the defragmentation thread and reducing the defrag write load.
  • Consider temporarily lowering the defrag-lwm-pct until the defrag queue has been consumed or available percent is getting too low.

Is a high defrag queue always a problem?

  • A high defrag_q does not necessarily indicate a problem. It simply means that many blocks are eligible for defragmentation.
  • As long as available percent (free blocks) is not critically low, it’s OK to let the queue drain slowly. In fact, letting blocks sit longer in the queue can reduce write amplification, as they may deplete further before they are rewritten. This results in fewer live records being copied and a lower defrag write workload.

When should I take action?

You should only tune aggressively if the following are true:

  • The defrag queue is not shrinking over time.
  • The number of free blocks (free_wblocks) / available percent (data_avail_pct) is consistently dropping.
  • Application latency is being impacted due to disk I/O latency.

Otherwise, the default settings are usually sufficient to balance background activity with disk space recovery.

Feedback

Was this page helpful?

What type of feedback are you giving?

What would you like us to know?

+Capture screenshot

Can we reach out to you?