Managing Storage
Overviewโ
Aerospike can absorb large amounts of data in a short time. The following examples describe options for configuring Aerospike to manage storage, without requiring continual intervention by administrators.
See Configuring namespace data retention for definitions of expiration, eviction, and stop-writes, and the configuration parameters for controlling those processes.
Defragmentationโ
Aerospike runs a continuous background defragmentation process to maximize the amount of available storage. When write-block (wblock) usage drops below the defrag-lwm-pct
limit, storage space occupied by stale data is reclaimed.
Write-blocks that are still in the post-write-queue
(before Database 7.1) or post-write-cache
(the name in Database 7.1 and later) are not candidates for defragmentation, even if the percent of live records in those write-blocks drops below the defrag-lwm-pct
. Therefore, the post-write-cache
(or post-write-queue
) should be kept small compared to the overall device size as the size allocated to the post-write-cache
will not be defragmented.
In Database 7.1 and later, write block was hard coded to 8MiB and the
write-block-size
configuration parameter was removed. A new flushing mechanism was introduced usingflush-size
.Starting with Database 7.0, all storage engines (
device
for SSDs,pmem
for Intel Optane Persistent Memory (PMem),shmem
for shared memory) use a unified storage format based on write-blocks; defragmentation works the same for all.
Aerospike requires free storage space in order to efficiently defragment the storage device while also performing a high volume of operations at low latency.
When defragmentation cannot keep up with storage requirements, you may have to increase the defragmentation rate.
You can use asadm
to check storage statistics.
- In Database 7.0 the metrics
data_avail_pct
,data_used_pct
are common to all storage engines. - Prior to Database 7.0 metrics included
device_available_pct
,pmem_available_pct
,device_free_pct
, andpmem_free_pct
.
The following command shows the device_available_pct
for the test
namespace in Aerospike 6.x:
asadm --enable -e "show statistics like device_available_pct for test"
The Aerospike defragmentation mechanismโ
Aerospike writes data to namespace storage-engine
in blocks.
- In Database 7.0 and earlier, the size is configured in the
write-block-size
parameter. Each wblock is filled with incoming write transactions and then flushed to a persistent storage device.
Database 7.1: flush-size
โ
Each wblock is filled with incoming write transactions and then flushed to a persistent storage device.
The flush-size
configuration parameter defines the size in bytes of each I/O unit that is written to disk. A flush event happens either when the 8MiB SWB is full, or when the flush-max-ms
period expires. At this point, the most recently written data is flushed from the SWB to disk in a series of flush-size
units. These writes are appended to each other until the write block is full.
You can increase or decrease the flush size dynamically. The default value is 1MiB and the configured value of this parameter must be a power of 2. The options are: 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M, and 8M. In most direct-attached NVMe devices, the ideal size is 128K.
Each device associated with a namespace has a write queue, and a cache. The configuration max-write-cache
controls the number of bytes of pending write blocks that the system is allowed to keep before failing writes, if the write queue can't immediately flush a streaming write buffer to a write block on the disk.
Prior to Database 7.1โ
Each wblock is filled with incoming write transactions and then flushed to a persistent storage device as follows:
When the streaming write buffer (of size
write-block-size
) is full, or when the next record to be written doesn't fit.When the streaming write buffer has not been flushed for
flush-max-ms
milliseconds (default of one second).On every write transaction when configured through the
commit-to-device
parameter forstrong-consistency
enabled namespaces.In Database 7.0, an in-memory namespace without having a storage-backed persistence
device
orfile
configured does not flush to any storage device; its wblocks reside in shared memory alone.
As records are updated or deleted, the active records capacity of the wblocks decreases. When a block usage level falls below the value set by the defrag-lwm-pct
parameter, it becomes eligible for defragmentation and is queued up in the storage-engine.device[ix].defrag_q. The default value of defrag-lwm-pct
is 50%.
The following four configuration parameters can be tuned for the defragmentation sub-system. You can set them dynamically, or in the aerospike.conf
server configuration file for a persistent configuration:
defrag-lwm-pct
(default: 50%). A higher percentage means more blocks are scheduled to be reclaimed, and more dense data on the device. The default value provides a good balance between space usage and write amplification.- For a given use case it may be desirable to increase
defrag-lwm-pct
and gain more usable space on the storage devices. In such instances, for example when the workload is read-heavy, write-amplification may be less of a factor. This should be tested, particularly to observe the effect on defragmentation load during operations which generate a lot of deletions, such as truncation or partitions dropping during migration. - In Database 7.0, for an in-memory namespace without storage-backed persistence you can similarly tune the
defrag-lwm-pct
higher, but here the trade-off is between space usage and CPU consumption. This should be adjusted carefully and observed.
- For a given use case it may be desirable to increase
defrag-sleep
: The default sleep time is 1000 microseconds after each wblock is defragmented.defrag-startup-minimum
defaults to 10%. If a minimum of 10% of data storage is not writable then the server will not join the cluster or open a service port.
Storage might appear full to Aerospike, because it writes all data in wblocks.
- Starting with Database 7.0 use
data_used_pct
to see how much of the total namespace data storage capacity is in use. - Prior to Database 7.0, use
device_free_pct
to see the total available writable space across all devices in the namespace.
defrag-queue-min
: The default is 0, do not defragment. Use a value greater than zero to define how many wblocks in the defrag-queue will initiate defragmentation.
The server log captures the defragmentation profile:
NAMESPACE-NAME /dev/sda: used-bytes 296160983424 free-wblocks 885103 write-q 0 write (12659541,43.3) defrag-q 0 defrag-read (11936852,39.1) defrag-write (3586533,10.2) shadow-write-q 0 tomb-raider-read (13758,598.0)
The details for each parameter are described in the log reference manual. The following metrics capture device statistics:
- storage-engine.device[ix].used_bytes
- storage-engine.device[ix].free_wblocks
- storage-engine.device[ix].write_q
- storage-engine.device[ix].writes
- storage-engine.device[ix].defrag_q
- storage-engine.device[ix].defrag_reads
- storage-engine.device[ix].defrag_writes
- storage-engine.device[ix].shadow_write_q
In the example log line, the writes per second are greater than the defragmentation writes. Writes per second include the defrag writes per second. Initially, this may not pose a problem but over a period of time, you may be running low on available wblocks. You may also want to monitor the defrag-q
, which should not be constantly increasing. If you determine the node is falling behind and the logs show an empty defragmentation queue, consider raising the defrag-lwm-pct
slightly. Be aware that raising the defrag-lwm-pct
will have a non-linear write amplification.
Search for write
and defrag-write
in your server logs to see more useful information:
tail -f /var/log/aerospike/aerospike.log | grep -ie write -e defrag-write /var/log/aerospike/aerospike.log
Increasing the defragmentation rateโ
You may need to temporarily decrease the defrag-sleep
and increase the defrag-lwm-pct
parameters.
Use the asadm
command-line interface to change defrag-sleep
:
Admin> enable
Admin+> manage config namespace TEST storage-engine param defrag-sleep to 500 with 10.0.0.1:3000
Expected output:
~Set Namespace Param defrag-sleep to 500~
Node|Response
10.0.0.1:3000|ok
Number of rows: 1
Change defrag-sleep
:
Admin+> manage config namespace TEST storage-engine param defrag-lwm-pct to 60 with 10.0.0.1:3000
Expected output:
~Set Namespace Param defrag-lwm-pct to 60~
Node|Response
10.0.0.1:3000|ok
Number of rows: 1
The new values will not persist after a server restart. Add your desired values to aerospike.conf
, in the namespace storage-engine section, to make them persistent:
defrag-sleep 500
defrag-lwm-pct 60
Stop-writesโ
See the detailed description of namespace eviction and stop-writes configuration parameters.
Prior to Database 7.0 min-avail-pct
referred to free wblocks
(write blocks), while max-used-pct
referred to namespace disk usage in bytes, compared to its total disk capacity.
Since Database 7.0, stop-writes-avail-pct
referred to free wblocks
(write blocks), while stop-writes-used-pct
referred to namespace storage usage in bytes, compared to its total storage capacity.
You can dynamically modify any of these stop-writes configuration parameters with asadm
:
asadm --enable -e "manage config namespace TEST param stop-writes-used-pct to 85 with 10.1.2.3"
Alternatively, use asinfo
:
# asinfo only talk to one node at a time
asinfo -h 10.1.2.3 -v "set-config:context=namespace;id=TEST;stop-writes-used-pct=85"
You can view your configured stop-writes parameters and their state with the show stop-writes
command.
Evictionsโ
You may choose to use eviction as a data management strategy for your namespace storage engine. Evictions are disabled by default in Database 4.9 and later.
When an eviction threshold of the namespace is crossed it triggers the namespace supervisor (NSUP) to start evicting data.
Verifying evictionsโ
The eviction counter is reset every time the server is restarted. Use the asadm
info command to verify that evictions are working the way you want:
Admin> info
This prints the free disk and memory available for each namespace. It also prints the configured limits to the eviction threshold for both memory and disk.
asadm -e "show statistics namespace for TEST like hwm_breached"
Inspect the Aerospike log for messages that show you may be evicting data. Run the following command on individual nodes:
grep -e "hwm_breached" -e "stop_writes" /var/log/aerospike/aerospike.log
NSUP not keeping upโ
If NSUP is not able to keep up with expiring records, it might take the node a long time to restart, as the node will first remove expired records before rejoining the cluster. In Database 6.3 and later, if the NSUP cycle takes longer than 2 hours and deletes more than 1% of the namespace, a warning line is written to the server log.
You can monitor the NSUP statistics nsup_cycle_duration
and nsup_cycle_deleted_pct
.
These are the stats used by the Monitoring Stack to trigger alerts and visually warn users.
You can control NSUP by dynamically configuring nsup-period
and
nsup-threads
.
asadm --enable -e "manage config namespace TEST param nsup-threads to 3"
Nodes will not start if there is not enough storageโ
If the database does not have enough contiguous storage to start, and does not have enough space to defragment to get the space it needs, it will not start.
For persistence files for in-memory databases, specify the size of the persistence file (in contrast to using an SSD, where you use the entire SSD). The persistence file size can also run out of space and the same rules apply as for SSDs.
When a namespace runs low on storageโ
When a namespace can no longer write data, you will see error messages in the log, like this example message:
Sep 05 2022 21:28:48 GMT: INFO (namespace): (base/namespace.c:458) {test} lwm breached true, hwm_breached true, stop_writes true, memory sz:22971755648 nobjects:358933683 nbytesmem:0 hwm:23192823808 sw:34789232640, disk sz:216122189312 hwm:216116854784 sw:341237137408
This shows that the namespace test
on the node has reached the high-water-mark for either disk or memory, and the stop-writes
percentage. As a result, the namespace can no longer accept write requests. Messages that look like this are the result of the stop_write
parameter being true either on this node, or other nodes:
Sep 05 2022 21:28:48 GMT: INFO (rw): (base/thr_rw.c:2300) writing pickled failed 8 for digest 7318ad7422e51009
Resolve this by adjusting configuration parameters:
- Increase your defragmentation priority or rate.
- Slow your migration speed, if migrations are active.
- If you configured evictions, speed up your current eviction rate by reducing the
evict-used-pct
in Database 7.0 and later. Usehigh-water-disk-pct
,high-water-memory-pct
prior to Database 7.0. - Increase the stop-writes configuration parameters such as
stop-writes-sys-memory-pct
. Usestop-writes-used-pct
in Database 7.0 or later. Usestop-writes-pct
prior to Database 7.0.
Increasing the stop-writes
parameters should not be done on a permanent basis. You need to find a permanent solution by reviewing your capacity and ensuring that there is sufficient storage.
All of these parameters can be changed dynamically in the main Aerospike configuration file on the node.
Avoiding 0% available spaceโ
When storage running low occurs too frequently, you will see log entries similar to the following:
Apr 27 2022 02:53:12 GMT: WARNING (drv_ssd): (storage/drv_ssd.c:1844) could not allocate storage on device /dev/sdb
Since Database 7.0, when data_avail_pct
goes to zero, all the subsequent writes will fail. This should not happen if the default stop-writes-avail-pct
is not modified.
Prior to Database 7.0, when the device_available_pct
(or pmem_available_pct
for PMem storage) goes to zero, all the subsequent writes will fail. This should not happen if the default min-avail-pct
is not modified.
Taking a server down increases traffic/data on the other nodes. Do not take any servers down if you are in a data overflow situation.
If only a single node is having problems because of a hardware problem, then taking down the problematic node may resolve the situation.
The solutions discussed here are short-term, temporary updates. In the longer term, you need to add capacity to resolve storage overflow problems.