The goal of this tutorial is to explain basic intuitions about modeling
with Aerospike. The key to getting the most out of Aerospike is to find
the right way to match an application’s object model and data access
needs to Aerospike’s data model and access methods.
This notebook contains:
A modeling-oriented overview of Aerospike’s architecture.
Questions about an application to help determine how to align to
Aerospike’s data model and/or data types.
Simple example API calls associated with each.
This notebook does not include:
Discussion of when to normalize and denormalizing data for a better
fit with Aerospike’s data model.
Detailed examples for each data model or data type.
Techniques for efficient reads or updates.
Other tutorials will focus on these facets of modeling in more detail.
This Jupyter
Notebook
requires the Aerospike Database running locally with Java kernel and
Aerospike Java Client. To create a Docker container that satisfies the
requirements and holds a copy of these notebooks, visit the Aerospike
Notebooks
Repo.
Ask Maven to download and install the project object model (POM) of the
Aerospike Java Client.
%%loadFromPOM
<dependencies>
<dependency>
<groupId>com.aerospike</groupId>
<artifactId>aerospike-client</artifactId>
<version>5.0.0</version>
</dependency>
</dependencies>
Start the Aerospike Java Client and Connect
Create an instance of the Aerospike Java Client, and connect to the demo
cluster.
The default cluster location for the Docker container is localhost
port 3000. If your cluster is not running on your local machine,
modify localhost and 3000 to the values for your Aerospike cluster.
System.out.println("Initialized the client and connected to the cluster.");
Output:
Initialized the client and connected to the cluster.
A KV Store with Deliberate Structure
Aerospike differentiates from other Key-Value Stores through its
architecture and the consequential structure and tools it provides. One
can throw documents of data at Aerospike and achieve some performance to
keep up with most applications. However, when applications must achieve
high performance at scale, expert use of Aerospike provides those
results. Those successful outcomes are due to the structure that other
Key-Value Stores do not provide.
Aerospike Uses All Storage Media Types to Achieve High Performance
Aerospike was architected to efficiently store document-oriented data.
Aerospike platform priorities include:
Efficient parallel use of all of a machine’s storage media,
especially flash storage (SSD, PCIe, NVMe).
Reads with sub-millisecond latencies at very high throughput (100K
to 1M), while under a mixed read/write load.
The Aerospike data model is a direct result of these priorities. These
modeling notebooks teach the principles behind modeling that will result
in proper use.
Schema-less Relational Database
The pieces of the Aerospike data model can be thought of as a mirror of
the anatomy of a relational database.
Namespace → Relational Database
Primary Index → Primary Index
Set → Table
Record → Database Row
Bin → Field
However, despite the similarities to their RDBMS counterpart, each of
these has a well-defined purpose and characteristics that make each
scale differently from each other.
Match the App Object Model to Aerospike’s Data Model
The best practice is to consider both of the following questions when
creating an application’s data model:
How to match the application object model classes with the Aerospike
data model elements?
What are the the application’s dimensions of data that must scale?
Because of Aerospike’s focus on scalability, properly matching the app
object and Aerospike data models will result in a highly performant and
scalable app.
From the application perspective, this consists of looking at the app’s
classes to determine the number and size of instances that will be
stored in the Aerospike database. Minimum size, maximum size, and
average size should all be considered, as well as the duration of
storage. In addition, consider implicit dimensions of storage, such as
how the data scales over time. Each object, including potentially
implicit dimensions, will be directly paired with one or more elements
of the Aerospike Data Model. Finally, consider the flows of how the data
is created, modified, and deleted.
To determine how to match with the Aerospike data mode, let’s first
discuss the elements of the Aerospike Data Model.
At low read and write volumes, the above may seem like unnecessary
complexity. However, as the application scales, the structure provided
by the Aerospike data model allows Aerospike to be used surgically at
petabyte scale more efficiently by (ROI x Performance) than most
varieties of database product. This is due to Aerospike’s architecture
and flexible data model that creates enough mesh points to match with a
complex application’s object model and implicit data dimensions.
The following sections share modeling-related details and API code for
working with those elements.
Namespace and Primary Index
The Namespace is a top level data container that associates index
and data with related storage media and policies that govern the data.
Because each type of data in a data model has different read/write
profile demands, it is common to divide further. For example, data for
an ecommerce app might store the hottest sales items in RAM, where the
rest are stored in Flash. In such a circumstance, the application may
store some identical data in 2 namespaces – 1 associating a subset of
products with RAM storage and 1 associating the full product data set
with Flash storage.
Because Namespaces are defined in the Aerospike configuration file, some
changes require a rolling warm restart to take effect. This
differentiates a Namespace from other data containers.
Each Aerospike server in a cluster has a Primary Index per namespace
detailing the location of all records in all storage media on the node.
Within the index, each record has a 64-byte footprint per record. The
weight of this footprint suggests that most Aerospike records should be
larger than a simple data type field. However, for the rare case of
extremely high throughput access, the index can contain a single numeric
element instead of the data record’s location.
Set
The Set is an optional label representing a segment of Records in a
Namespace. A set facilitates fast access to its members.
Key and Digest
A Record is uniquely identified by a namespace and Digest. The
digest is a client-generated RIPEMD-160 20-Byte hash of the set name and
the user key. The user key is the application’s unique identifier for a
record in Aerospike – a string, a number, or a bytestream. The user key
can be optionally stored in the Aerospike Database. The user key can be
optionally stored in the Aerospike Database.
Creating a Key using Namespace, Set, and User Key
The following is Java Client code to create a key using the namespace,
set, and user key.
importcom.aerospike.client.Key;
StringnamespaceName="test";
StringsetName="dm101set";
IntegertheKey=0; // A key can be an integer, string, or blob.
Keykey=newKey(namespaceName, setName, theKey);
System.out.println("Key created.");
Output:
Key created.
Record
Aerospike offers record-level ACID-compliance. That is, Aerospike
allows execution of multiple record-operations as one atomic,
consistent, isolated, and durable transaction by way of the operate
method.
The structure of a record is a Map containing:
Metadata
Expiration
Last Update Time
Generation Counter
Map of Bins
Bin
A Bin is a flexible container that contains one data Value. A Value
has an associated scalar or collection data type, however a Bin’s data
type is not formally declared in a schema.
Creating a Simple Record Containing An Integer and A String Bin
The following Java client code uses the key from the previous code
example to put integer and string data into a record in Aerospike.
Lists and Maps are Collection Data Types (CDTs). These are
flexible, schema-less data types that can contain Values of any data
type, either scalar data or collection data. Collection Data Types can
be nested as deeply as necessary to match an application’s needs.
A List is a
collection of Values. For data efficiency, Lists are frequently used
as tuples, a lightweight
record structure using position instead of field names.
A Map is a
collection of mapkey/Value pairs. Maps are commonly used for
JSON-like data structures.
Because a Record contains one or more Bins, and a Bin or CDT can contain
a scalar data type or collection data type, the most common question to
consider when creating an application’s data model in Aerospike is
whether to store a class instance as a Record, Bin, CDT, or nested CDT.
After operation, tuple: {world-records=[[9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]]}
Distinguishing Questions
From a modeling perspective, each Aerospike data model element is a
potential mesh point with the application object model that will help
you to store object instances from the application in Aerospike. The
following questions instruct broadly how to fit them together. The
italic text after the question explains the intuition to apply.
Questions related to Storage Medium.
Q: Does the application require a specific storage medium for a
particular type of data, to achieve a necessary scale and frequency of
reads or writes?
The easiest way to match data to hardware is to assign it to the right
namespace. Namespaces associate index and data with storage media, like
fast NVMe drives, persistent memory, or DRAM.
Q: Does the application need to store an integer or float with
extremely frequent reads or writes?
Aerospike can store an integer or float in the primary index, instead
of storing a memory location of the data. This provides even faster
access than storing in DRAM media.
Questions related to Data Reads and Writes
Q: Does the application have an object class for which a large number of
instances need to be stored in Aerospike and are frequently read
together by the application?
There are two common options:
If the instance size is very small (For example, the size is
measured in bytes, not KiB) and rarely updated, then multiple such
instances can be stored grouped in one or more Aerospike Records.
The instances can be split among records using a naming convention
based on how the application seeks to access the data. For example,
IoT sensor readings are commonly written once, as observed, and then
read by date range. They can therefore be stored by a primary key
made from the object name and date.
If the instance size is medium to large size, then the object is
better stored as a Record, and using a common Set name to group
them. The fastest way to read a group of application Records is
either:
Scanning an Aerospike Namespace or Set of Records.
Creating a secondary index and querying against that secondary
index. At large scale, the time taken by parsing every Record in
a Namespace as part of a scan is significant. Set indexes (added
in version 5.6), a type of secondary index, will make scanning a
Set much faster, if the Set is a small percentage of the
Namespace.
Q: Do writes occur grouped into transactions or are individual
pieces of data updated one by one?
Aerospike provides single-record transactions that are ACID compliant.
Store data requiring atomic updates in one or more Bins in the same
Aerospike Record, and use theOperateAPI to execute a
multiple-operation transaction. If updates occur element by element,
data can be stored in one or more Records, or in one or more Bins of
data.
Q: During a single database transaction when updating data in an
instance of an application object, are both of the following true?
Reads are interspersed with writes
Operations are executed on different parts of a record, as-if they
were different objects
It can be helpful to store data as-if different different objects in
separate Bins.
When applying transaction operations on a record usingOperate(),
the Aerospike client delivers the return values from operations per Bin.
These return values can be accessed in order, making transaction results
easier to work with when data is put in separate Bins.
Q: Is the size of a set of application records large? (For example,
Measured in MiBs rather than in KiBs.)
There is an inherent trade-off in record size, as updating an app record
will require a read, modify, and write of the entire Aerospike record.
Consider storing the app record in more than one Aerospike record,
rather than in a single monolithic record.
When data is large, taking advantage of an intrinsic property of the
data, like a timestamp, can help to distribute data in an intuitive way
across records. Including timestamp in a set name or user key name, for
example, allows more efficient reads and writes. It will also allow
graceful rotation of data.
Questions related to Deletes
Q: If there are no updates, can data naturally age out of the
application?
It is common for applications to naturally allow data to expire after
creation or update. Aerospike records have an Expiration metadata
field that can be used to automatically expire data and reclaim storage
space. All operations can be configured with a policy to set or update
the Expiration*.*
An example of this is a bank keeping track of a customer’s put stock
option. An option grants the
holder the right to make a stock transaction until a specified date that
is determined at the purchase time. Once the expiration has passed, the
option expires and the holder no longer has the right. The bank would
model this in their computer systems using Expiration.
Q: Does the application require for a group of associated records
that are created at distinct times to be removed at the same time?
The most common way to explicitly rotate out data at intervals is to
store Aerospike Records in Sets and truncate the Sets from the
associated Namespace, when appropriate.
An example of this is data that accrues over the course of a day, but
then is worthless. One way to model the data would be to insert into a
Set named for the day, and at the end of the day, the application would
truncate the Set.
Questions related to Application Scale
Q: Does your application volume result multiple servers routinely
suffering simultaneous downtime for disrepair or service?”
It is common for Aerospike clusters when the model is architected
properly, to replace competing databases at a 1:5 (Aerospike:Other)
ratio. When handling downtime, it will be important to configure whether
Aerospike will run in AP mode or SC mode.
AP or High Availability Mode– Prioritizes data availability for
reads over data replication.
SC or Strong Consistency Mode– Prioritizes writes and data
replication across an Aerospike cluster over reads.
Initialized the client and connected to the cluster.
Key created.
Read from Aerospike –
Generation count: 2
Record expiration: 364672908
int: 8
str: modeling
tuple: [9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]
maptuple: {world-records=[[9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]]}
Set Truncated.
Server connection(s) closed.
Takeaways – Data Modeling is an Art and a Science
Data modeling with Aerospike is a science, but deep enough that it will
seem like an art at first. An intuitive matching of your application
object model with Aerospike’s data model will generally result in a
successful application.
When pushing the envelope of performance, do not hesitate to use
additional resources. A great way to learn more about modeling is to,
post questions to the data modeling discussion
forum.
This is especially worthwhile to optimize Aerospike performance for an
application. In addition, discussing requirements with Aerospike’s
Solutions Architect team will still result in performance improvements
and increase your ROI using Aerospike.
Knowing the Right Questions to Ask is the First Step
By nature, the above is incomplete knowledge on Modeling. This notebook
may be updated with additional questions over time.