Amazon EC2 Capacity Planning
For details on disk and memory sizing, see the main capacity planning docs.
AWS brings a large number of instance types, but some are better suited for running Aerospike than others. Below we explore the different instance types and what configuration they lend themselves to.
Instance Type | Use case |
---|---|
m5(d) | Low Cost |
r5(d) | In Memory clusters |
c5(d) | low latency/high throughput |
i3 | High Data Capacity |
i3en | Data Capacity + Throughput |
- Do not use burst instances Noisy neighbors could borrow cpu/bandwidth causing latency spikes.
- Instance with shared SSD controller Instances with less than a full sized local SSD will share an SSD controller.
- i3 Need to be over provisioned by 20%.
- m5d, r5d, c5d Have multiple disk drives, test your pool to get the lowest common denominator.
- OS Amazon Linux 2023 with Database 6.4 and later is recommended for compatibility and performance.
Aerospike as In-Memory with no Persistence
The Data In-Memory without Persistence storage engine is ideal for a cache based use-case.
Aerospike Network Planning
Each network interface on an Amazon Linux HVM instance can handle about 250K packets per second. If you need higher performance per instance, you need to do one of the following:
Add More NIC/ENI Elastic Network Interfaces (ENI) provide a way of adding multiple (virtual) NICs to an instance. A single NIC peaks at around 250k TPS, bottlenecking on cores processing interrupts. Adding more interfaces helps to process more packets per second on the same instance. Using ENIs with private IPs is free of cost in AWS.
noteYou can specify separate network interfaces for service, info and fabric traffic. This will help alleviate both packets per second and bandwidth concerns with individual ENIs. But adds to the complexity of your Aerospike cluster.
Receive Packet Steering
noteRPS is only available in kernel version 2.6.35 and above.
Another simpler approach is to distribute IRQ over multiple cores using RPS
echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus
With Aerospike, this eliminates the need to use multiple NICs/ENIs, making management easier, and resulting in similar TPS. A single NIC with RPS enabled can achieve up to 800K TPS with interrupts spread over 4 cores. Ensure your instance types have been sized appropriately for this.
Aerospike as a Fast Persistent Data Store
The storage engine suited for this use case is the SSD Storage Engine.
Amazon EC2 provides storage in the form of Elastic Block Storage, or EBS. These are network attached to virtual machine instances.
EBS performance is either set using Provisioned IOPS or General Purpose. Provisioned IOPS (io1) delivers consistent IOPS but are costly. General Purpose (gp2) volumes have variable performance based on size. See the AWS documentation on the relationship between volume size and IOPS for gp2 volumes.
High Availability using Availability Zones
Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of regions and Availability Zones. Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Resources aren't replicated across regions unless you do so specifically. Amazon operates state-of-the-art, highly-available datacenters. Although rare, failures can occur that affect the availability of instances that are in the same location. Refer to AWS Regions and Availability Zones documentation
Ephemeral SSD Based Cache Backed by EBS Persistence
Otherwise known as Shadow Device configuration.
Pros:
- RAM requirement is same as the EBS persistence model only.
- Provides persistence offered by EBS while surpassing the performance bottleneck of EBS by making use of ephemeral SSDs performance as caching layer.
- Provides the best of performance and persistence possible by using Ephemeral SSD as RAM alternative along with EBS for persistence storage.
Cons:
- More operational overhead than any other storage models.
- Need to use instances supporting the required number and amount of ephemeral SSD instance storage volumes.
Autoscaling
- There are no logical default thresholds and step sizes, they need to be based on your workload characteristic and other non-standard Cloudwatch metrics
- Autoscaling a cluster in can lose data! Ensure you know what your lower bound is and that it is properly set in your ASG logic!
- Autoscaling a cluster out and back for daily cycles can cost more in cross AZ network cost (migrations) than is saved in compute!
Don't lose data! Ensure your autoscaling safely!
Suggested Reading
We performed benchmarks on AWS and have outlined our observation in external blogs. The following provides helpful suggested steps and expectations before you get set up with AWS.