SSD Setup
Aerospike’s flash-optimized architecture was designed with modern flash devices (SSDs) in mind. It is crucial to understand that installing and setting them up is somewhat different from rotational drives. You should check the list of approved flash devices and see if your model is there. You may also want to run the Aerospike Certification Tool (ACT) within your environment, even if your SSD is on the approved list.
When reading or writing records, Aerospike uses direct IO to the drive. Aerospike uses Linux direct (or raw device) interface in order to achieve high performance. There is no filesystem on the data drive, so there is no need to RAID or stripe the drive. It does need to be a filesystem on any drive holding a flash index.
There are many factors that affect flash performance. The following instructions are for tuning flash to use with Aerospike. These methods may differ from other uses.
Proper flash configuration
The most important factors for any given flash device are (in very roughly descending order of importance):
Do not use disks that are unevenly worn
Unlike rotational drives, flash devices are affected by rewriting over the same area again and again. While Aerospike wears through the drive evenly, if the drive was used for another purpose in the past, it may have worn areas where the performance will suffer dramatically. This is particularly true if the SSD is in a cloud environment. If you don't know how the drive was used, you should assume it was unevenly worn.
Use a good model flash device
Flash device models vary greatly in performance. They even vary a lot depending on the use case. Proper selection of one is of great importance. See the approved flash device list for more information.
Connect the drives properly
Even the best flash will be hampered by a poor controller. The best way to connect flash devices to the motherboard is without RAID. This can be through a SATA controller directly on the motherboard or through a controller that allows direct access to the drive. Aerospike has generally found good results for disk controllers specifically designed for Apache Hadoop. For example, Aerospike tested the HP P420 and found very good results. In addition, if your RAID controller uses the LSI 2208 chip, you may be able to take advantage of the StorCLI (AKA MegaCLI) package. This has improved performance dramatically. See the using StorCLI (AKA MegaCLI) page for more information.
If your server requires the use of RAID, it is best to configure each flash device as its own RAID 0 array. You should configure the arrays with write-through mode and no-read ahead for the cache policy. The preferred method for installing your flash devices is with direct-connect or with a RAID controller in pass-through mode (i.e. not using RAID features). We do not recommend RAID with Aerospike Database.
Over-provision your flash devices
Flash over-provisioning sets aside space for the drive controller to do its work. You can think of it as reserved free space to move data around. When this space is too small, the controller must do much more work and thus will take more time to do its job. Some flash devices (such as the Intel S3700) come with enough free space and no changes are necessary. Others (such as the Intel S3500) will need additional over-provisioning to maintain peak performance under load. In the Aerospike approved list, we provide whether or not this was required. For those disks that require additional over-provisioning, we normally recommend 21%. So a 512 GB SSD will be about 400 GB after the over-provisioning.
There are 2 ways to over-provision flash devices.
- Using the Host Protected Area (HPA). This is a low-level way to protect an area for use by the controller. However, it may not be possible to set this on flash devices behind a RAID controller. You may either set the HPA from another machine and move it or use disk partitions (next).
- Using disk partitions. In this case you will actually be creating a partition that is smaller than the full size of the disk and using the partition rather than the disk. For many flash devices (such as those from Intel and Samsung), this will have the same impact as setting the HPA. Not all SSDs will make use of unpartitioned space as if it were reserved for over-provisioned use. Check with your manufacturer if you are not sure.
Refer to the instructions on how to over-provision.
Partition your flash devices
Some definitions:
- Partition - this is the standard definition of partition for all hard disks, whether they are rotational or SSD.
- MBR - the Master Boot Record. This is the mechanism used for many years to contain information on how the disk is to be partitioned. This has a limit of 2 TB. There is a limit of 4 physical partitions and 15 logical partitions when using MBR.
- GPT - the GUID Partition Table. This is the replacement for MBR. It (GPT) can coexist with MBR. This gets past the biggest limits of MBR. It allows for 127 partitions and a maximum size of 2^64 sectors. With 512B sectors, this translates to 9.4 ZB.
- Device - This can get confusing as it can refer to either an entire disk or a single partition on one. For the purposes here, device refers to a single device declaration within the OS layer (e.g. /dev/sdb or /dev/nvme0n1)
Independently of the over-provisioning related points mentioned earlier, the following points should be considered when deciding how many partitions to create on a Flash (SSD) device:
- Aerospike supports devices over 2TiB by requiring you to partition the physical device into smaller partitions. It may also be beneficial to partition 2TiB or smaller devices in order to increase parallelism as well as the number of device specific threads. Therefore, even more important than the partition size, it is the number of partitions that will drive the performance. Aerospike's best practices suggest benchmarking different configurations and keeping the number of partitions usually at most equal to the number of CPU cores. For example, on a 16 core system, it would be suggested to have 16 partitions of 800GiB each rather than 32 of 400GiB. But on a 32 core system, it may be beneficial to go up to 32 partitions of 400GiB each, potentially even higher, depending on the workload. Having too many partitions on the same physical device could also adversely impact performance. It is always recommended to fully benchmark the production workload at peak level on a small system in order to validate the configuration.
- Partitions on an SSD device do not have to be the same size. However, Aerospike does not treat them differently within a given namespace. So if you have a 100 GB partition and a 200 GB partition in the same namespace, the system will start running into performance issues when the 100 GB one reaches < 5% available mark and will get into stop_writes, as soon as the smallest partition hits the threshold. Therefore, it is necessary to keep devices in the same namespace the same size. This limitation does not apply between different namespaces. However, the total speed of the device will be the same across the namespaces. So if you have 2 different partitions with 2 different namespaces, one not having a lot of traffic, but the other one having tremendous traffic, the fast one may impact the load on the slower one.
- Having multiple partitions on an SSD device improves performance on cold start for large capacity devices. Cold starting a namespace requires the whole underlying storage device to be scanned, hence having more partitions allows for parallelizing this effort. For a given device capacity, results may differ between NVMe and SATA devices.
- While the raw read/write performance should not be impacted by the number of partitions, there is one thing that would affect it. Aerospike must do defragmentation at the application layer. This is currently single threaded per partition and the requirements for this are based on how fast you are writing to the system. Having more partitions increases the number of threads for the defragmentation process (one thread per partition).
Sharing devices across namespaces is not allowed
The Aerospike process aborts and logs the following messages if the same device is shared across namespaces:
Jun 09 2022 20:39:10 GMT: CRITICAL (drv_ssd): (drv_ssd.c:2237) /etc/aerospike/test.dat: previous namespace test now bar - check config or erase device
Jun 09 2022 20:39:10 GMT: WARNING (as): (signal.c:218) SIGUSR1 received, aborting Aerospike Enterprise Edition build 6.1.0.0 OS ubuntu20.04
Example of partitioning a drive
Open fdisk against /dev/sdb
$ sudo fdisk /dev/sdb
Welcome to fdisk (util-linux 2.27.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Create a new DOS/GPT partition table, if one doesn't exist yet (note - existing data, if any, will be overwritten on /dev/sdb). GPT (preferred):
Command (m for help): g
Created a new GPT disklabel (GUID: 7A193027-381B-4712-8D76-117908120FF7).
DOS (in case GPT is not available):
Command (m for help): o
Created a new DOS disklabel with disk identifier 0xfe6c3bc5.
Create 2 new partitions, of example size 500GiB each
Command (m for help): n
Partition number (1-128, default 1):
First sector (2048-2097118, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-2097118, default 2097118): +500G
Created a new partition 1 of type 'Linux filesystem' and of size 500 GiB.
Command (m for help): n
Partition number (2-128, default 2):
First sector (1026048-2097118, default 1026048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-2097118, default 2097118): +500G
Created a new partition 2 of type 'Linux filesystem' and of size 500 GiB.
Write changes to disk:
Command (m for help): w
The partition table has been altered.
Synching disks.
If you receive an error with writing changes, don't see "syncing disks", or don't see /dev/sdb{1,2} devices following that, you will need to reboot the machine as the kernel partition refresh failed.
Initialize the drives
If you are ready to use the flash devices, you will need to Initializing Solid State Drives (SSDs) before using them with Aerospike Database.