Configure the primary index
The primary index of an Aerospike namespace can be stored in three different mediums: shared memory (SHMem) by default, Intel® Optane™ Persistent Memory (PMem), or Flash device (on NVMe SSDs). Separate namespaces within the same cluster can use different types of primary index storage.
To specify a primary index storage method, use the namespace configuration parameter index-type
.
- The default
index-type shmem
stores primary index metadata in shared memory segments. index-type pmem
specifies PMem storage for the namespace primary index.index-type flash
specifies NVMe SSD storage for the namespace primary index.
For sizing information, see capacity planning.
Primary index in memory
By default the namespace primary index storage type in Aerospike Database Enterprise Edition variants (EE, FE, SE) is shared memory, equivalent to explicitly setting index-type shmem
.
Primary index on Flash
The Aerospike All Flash feature allows primary indexes to be stored on an NVMe SSDs. This index storage method is typically used for extremely large primary indexes with relatively small records.
It is important to understand the subtleties of primary index on flash sizing as scaling up an All Flash namespace may require an increase of partition-tree-sprigs
which would require a rolling cold restart.
Set kernel parameters
The following Linux kernel parameters are required in an index on Flash deployment. enforce-best-practices
verifies that these kernel parameters have the expected values.
/proc/sys/vm/dirty_bytes = 16777216/proc/sys/vm/dirty_background_bytes = 1/proc/sys/vm/dirty_expire_centisecs = 1/proc/sys/vm/dirty_writeback_centisecs = 10
- When running as non-root, you must prepare these values before running the Aerospike server.
- When running as root, the server configures them automatically.
Either way, if these parameters can’t be correctly set manually, or automatically by the server, the node will not be able to start up with a index-type flash
configuration.
Prepare and mount Flash devices
Aerospike instantiates at least 4 different arena allocations (as files) for each device partition configured for use by the namespace. This helps reduce contention against the same arena, which improves performance during heavy insertion loads.
An XFS filesystem is recommended because it has been shown to provide better concurrent access to files than EXT4.
sudo mkfs.xfs -f /dev/nvme0sudo mount /dev/nvme0 /mnt/nvme0
-
Using more physical devices will improve performance through increased parallelism of disk IO.
-
Assigning more partitions per physical device doesn’t necessarily improve performance.
Subcontext configuration
In the index-type
subcontext define
-
A
mount
point for each mounted Flash device. Mount points can be shared across multiple namespaces. -
A
mounts-budget
(ormounts-size-limit
before Database 7.0) directive to indicate this namespace’s share of device storage space available across the given mount points. -
Ensure the budget is less than or equal to the size of the filesystem. If mount points are not shared between namespaces, then simply specify the total available space.
-
An optional eviction threshold as a percent of the budget can be defined through
evict-mounts-pct
(ormounts-high-water-pct
before Database 7.0).
Example
Database 7.0 and later
namespace test { partition-tree-sprigs 1M # Typically very large for flash index - see sizing guide. index-type flash { mounts-budget 1T mount /mnt/nvme0 # will have a 250GiB budget mount /mnt/nvme1 # will have a 250GiB budget mount /mnt/nvme2 # will have a 250GiB budget mount /mnt/nvme3 # will have a 250GiB budget }}
Prior to Database 7.0
namespace test { partition-tree-sprigs 1M # Typically very large for flash index - see sizing guide. index-type flash { mount /mnt/nvme0 mount /mnt/nvme1 mount /mnt/nvme2 mount /mnt/nvme3 mounts-size-limit 1T }}
Capacity planning example
Here is a summary for calculating the disk space and memory required for a namespace with 4 billion records, a replication factor of 2, with the primary index on flash.
Number of sprigs required
- 4 billion records ÷ 4096 partitions ÷ 32 records per sprig, to retain half-fill-factor = ~30,517
- Round up to power of 2: 32,768 sprigs per partition
Disk space required
- 32,768 sprigs per partition × 4096 partitions × 2 replication factor × 4KiB size of each block = 1TiB for the whole cluster
- 1TiB required for the whole cluster ÷ 3 minimal number of nodes ÷ 0.8 with
mounts-budget
/mounts-size-limit
at 80% = 427GiB per node
Because All Flash uses a filesystem with multiple files, the mount point size should be slightly larger than 427GiB (for actual usable index storage space) to accommodate the filesystem overheads. This is filesystem-dependent.
Memory required
With Database 5.7 or later, where 10 bytes are required per sprig:
- 32,768 sprigs per partition × 4096 partitions × 2 replication factor × 10 bytes memory required per sprig = 2,560MiB for the whole cluster
- 2,560MiB required for the whole cluster ÷ 3 minimal number of nodes ÷ 0.8 with
mounts-budget
/mounts-size-limit
at 80% = 1,066MiB per node
Or with server versions prior to 5.7, where 13 bytes are required per sprig:
- 32,768 sprigs per partition × 4096 partitions × 2 replication factor × 13 bytes memory required per sprig = 3,328MiB for the whole cluster
- 3,328MiB required for the whole cluster ÷ 3 minimal number of nodes ÷ 0.8 with
mounts-budget
/mounts-size-limit
at 80% = 1,387MiB per node
Primary index in PMem
When the namespace primary index storage is configured to index-type pmem
, Aerospike writes index metadata in multiple files spread across the configured PMem devices. This requires device partitions to be set up with an appropriate filesystem and mounted.
Aerospike requires PMem to be accessible using DAX (Direct Access),
that is, using block devices such as /dev/pmem0
:
- The NVDIMM regions must be configured as
AppDirect
regions, as in the following example from a machine with a 750-GiBAppDirect
region:
sudo ipmctl show -regionSocketID ISetID PersistentMemoryType Capacity FreeCapacity HealthState0 0x59727f4821b32ccc AppDirect 750.0 GiB 0.0 GiB Healthy
- The NVDIMM regions must be turned into
fsdax
namespaces, as in the following example from the same machine:
sudo ndctl list[ { "dev":"namespace0.0", "mode":"fsdax", "blockdev":"pmem0", ... }]
Filesystem configuration
The PMem block device must contain a filesystem that is capable of DAX (Direct Access), such as XFS or ext4. On the machine in the above example, this could be accomplished in the usual way:
Prepare an XFS filesystem
sudo mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
ext4 filesystem:
sudo mkfs.ext4 /dev/pmem0
Mount the filesystem
Finally, the file system must be mounted with the dax
mount
option. The dax
mount option is important. Without this option, the Linux page cache is involved in all I/O to and from persistent memory, which would
drastically reduce performance.
In the following example, we use /mnt/pmem0
as the mount
point.
sudo mount -o dax /dev/pmem0 /mnt/pmem0
Remember to make the mount persistent to survive system reboots by adding it to /etc/fstab
. The mount point configuration line can be copied from /etc/mtab
to /etc/fstab
.
Mount point configuration
In the index-type
subcontext define
-
A
mount
point for each PMem device. Secondary index metadata will be evenly distributed across all of them. Mount points can be shared across multiple namespaces. -
A
mounts-budget
(ormounts-size-limit
before Database 7.0) directive to indicate this namespace’s share of PMem storage space available across the given mount points. -
Ensure the budget is less than or equal to the size of the filesystem. If mount points are not shared between namespaces, then simply specify the total available space.
-
An optional eviction threshold as a percent of the budget can be defined through
evict-mounts-pct
(ormounts-high-water-pct
before Database 7.0). The following configuration snippet extends the above example and makes all of/mnt/pmem0
memory (for example, 750GiB) available to the namespace:
Example
The following configuration snippet extends the earlier example and
makes all of /mnt/pmem0
memory (for example, 750 GiB) available to the
namespace:
Database 7.0 and later
namespace test { index-type pmem { mount /mnt/pmem0 mounts-budget 750G }}
Prior to Database 7.0
namespace test { index-type pmem { mount /mnt/pmem0 mounts-size-limit 750G }}
Primary index on Flash
The Aerospike All Flash feature allows primary indexes to be stored on an NVMe SSDs.
This index storage method is typically used for extremely large primary indexes with relatively small records. Accuracy is critical for certain aspects of capacity planning and configuration.
All Flash kernel parameters
Enable flash index for a namespace
To enable a flash index for a namespace, in the configuration file, add an
index-type
subsection
with an index type of flash
to its namespace section. The added
index-type
subsection must contain:
-
One or more
mount
directives to indicate the mount points on the flash storage to be used for the flash index.A single namespace can use flash index storage across multiple mount points and will evenly distribute allocations across all of them.
Conversely, mount points can be shared across multiple namespaces. The file names underlying namespaces’ flash index allocations are namespace-specific, which avoids file name clashes between namespaces when they share mount points.
-
A
mounts-budget
(ormounts-size-limit
before Database 7.0) directive to indicate this namespace’s share of the space available across the given mount points.When multiple namespaces share mount points, this configuration directive tells Aerospike how much of the total available memory across mount points each namespace is expected to use.
Ensure
mounts-budget
/mounts-size-limit
is smaller or equal to the size of the filesystem mount.If mount points are not shared between namespaces, then simply specify the total available space.
-
The specified value, along with configuration item
evict-mounts-pct
(ormounts-high-water-pct
before Database 7.0), which is disabled by default, forms the basis for calculating the eviction threshold.
Recommended filesystem type - XFS
An XFS file system is recommended because it has been shown to provide better concurrent access to files compared to ext4.
Recommendation for multiple physical devices
Having more physical devices improves performance by increasing parallelism across those. More partitions per physical device doesn’t necessarily improve performance. Aerospike instantiates at least 4 different arena allocations (files) and will allocate more if more devices (logical partitions or physical devices) are present. Instantiating more than 1 arena at a time helps with contention against the same arena, which is important during heavy insertion loads.
Database 7.0 and later
namespace test { partition-tree-sprigs 1M # Typically very large for flash index - see sizing guide. index-type flash { mounts-budget 1T mount /mnt/nvme0 # will have a 250GiB budget mount /mnt/nvme1 # will have a 250GiB budget mount /mnt/nvme2 # will have a 250GiB budget mount /mnt/nvme3 # will have a 250GiB budget }}