Introduction to Data Modeling
For an interactive Jupyter notebook experience:
Last updated: June 22, 2021
The goal of this tutorial is to explain basic intuitions about modeling with Aerospike. The key to getting the most out of Aerospike is to find the right way to match an application’s object model and data access needs to Aerospike’s data model and access methods.
This notebook contains:
- A modeling-oriented overview of Aerospike’s architecture.
- Questions about an application to help determine how to align to Aerospike’s data model and/or data types.
- Simple example API calls associated with each.
This notebook does not include:
- Discussion of when to normalize and denormalizing data for a better fit with Aerospike’s data model.
- Detailed examples for each data model or data type.
- Techniques for efficient reads or updates.
Other tutorials will focus on these facets of modeling in more detail.
This Jupyter Notebook requires the Aerospike Database running locally with Java kernel and Aerospike Java Client. To create a Docker container that satisfies the requirements and holds a copy of these notebooks, visit the Aerospike Notebooks Repo.
Notebook Setup
Import Jupyter Java Integration
Make it easier to work with Java in Jupyter.
import io.github.spencerpark.ijava.IJava;
import io.github.spencerpark.jupyter.kernel.magic.common.Shell;
IJava.getKernelInstance().getMagics().registerMagics(Shell.class);
Start Aerospike
Ensure Aerospike Database is running locally.
%sh asd
Download the Aerospike Java Client
Ask Maven to download and install the project object model (POM) of the Aerospike Java Client.
%%loadFromPOM
<dependencies>
<dependency>
<groupId>com.aerospike</groupId>
<artifactId>aerospike-client</artifactId>
<version>5.0.0</version>
</dependency>
</dependencies>
Start the Aerospike Java Client and Connect
Create an instance of the Aerospike Java Client, and connect to the demo cluster.
The default cluster location for the Docker container is localhost port 3000. If your cluster is not running on your local machine, modify localhost and 3000 to the values for your Aerospike cluster.
import com.aerospike.client.AerospikeClient;
AerospikeClient client = new AerospikeClient("localhost", 3000);
System.out.println("Initialized the client and connected to the cluster.");
Output:
Initialized the client and connected to the cluster.
A KV Store with Deliberate Structure
Aerospike differentiates from other Key-Value Stores through its architecture and the consequential structure and tools it provides. One can throw documents of data at Aerospike and achieve some performance to keep up with most applications. However, when applications must achieve high performance at scale, expert use of Aerospike provides those results. Those successful outcomes are due to the structure that other Key-Value Stores do not provide.
Aerospike Uses All Storage Media Types to Achieve High Performance
Aerospike was architected to efficiently store document-oriented data. Aerospike platform priorities include:
- Efficient parallel use of all of a machine’s storage media, especially flash storage (SSD, PCIe, NVMe).
- Reads with sub-millisecond latencies at very high throughput (100K to 1M), while under a mixed read/write load.
The Aerospike data model is a direct result of these priorities. These modeling notebooks teach the principles behind modeling that will result in proper use.
Schema-less Relational Database
The pieces of the Aerospike data model can be thought of as a mirror of the anatomy of a relational database.
- Namespace → Relational Database
- Primary Index → Primary Index
- Set → Table
- Record → Database Row
- Bin → Field
However, despite the similarities to their RDBMS counterpart, each of these has a well-defined purpose and characteristics that make each scale differently from each other.
Match the App Object Model to Aerospike's Data Model
The best practice is to consider both of the following questions when creating an application’s data model:
- How to match the application object model classes with the Aerospike data model elements?
- What are the the application’s dimensions of data that must scale?
Because of Aerospike’s focus on scalability, properly matching the app object and Aerospike data models will result in a highly performant and scalable app.
From the application perspective, this consists of looking at the app's classes to determine the number and size of instances that will be stored in the Aerospike database. Minimum size, maximum size, and average size should all be considered, as well as the duration of storage. In addition, consider implicit dimensions of storage, such as how the data scales over time. Each object, including potentially implicit dimensions, will be directly paired with one or more elements of the Aerospike Data Model. Finally, consider the flows of how the data is created, modified, and deleted.
To determine how to match with the Aerospike data mode, let's first discuss the elements of the Aerospike Data Model.
Elements of the Aerospike Data Model
The following are the elements of the Aerospike Data Model:
- Namespace and Primary Index
- Set
- Key and Digest
- Record
- Bin
- Collection Data Types
- List
- Map
- Collection Data Types
At low read and write volumes, the above may seem like unnecessary complexity. However, as the application scales, the structure provided by the Aerospike data model allows Aerospike to be used surgically at petabyte scale more efficiently by (ROI x Performance) than most varieties of database product. This is due to Aerospike’s architecture and flexible data model that creates enough mesh points to match with a complex application's object model and implicit data dimensions.
The following sections share modeling-related details and API code for working with those elements.
Namespace and Primary Index
The Namespace is a top level data container that associates index and data with related storage media and policies that govern the data. Because each type of data in a data model has different read/write profile demands, it is common to divide further. For example, data for an ecommerce app might store the hottest sales items in RAM, where the rest are stored in Flash. In such a circumstance, the application may store some identical data in 2 namespaces – 1 associating a subset of products with RAM storage and 1 associating the full product data set with Flash storage.
Because Namespaces are defined in the Aerospike configuration file, some changes require a rolling warm restart to take effect. This differentiates a Namespace from other data containers.
Each Aerospike server in a cluster has a Primary Index per namespace detailing the location of all records in all storage media on the node. Within the index, each record has a 64-byte footprint per record. The weight of this footprint suggests that most Aerospike records should be larger than a simple data type field. However, for the rare case of extremely high throughput access, the index can contain a single numeric element instead of the data record’s location.
Set
The Set is an optional label representing a segment of Records in a Namespace. A set facilitates fast access to its members.
Key and Digest
A Record is uniquely identified by a namespace and Digest. The digest is a client-generated RIPEMD-160 20-Byte hash of the set name and the user key. The user key is the application’s unique identifier for a record in Aerospike – a string, a number, or a bytestream. The user key can be optionally stored in the Aerospike Database. The user key can be optionally stored in the Aerospike Database.
Creating a Key using Namespace, Set, and User Key
The following is Java Client code to create a key using the namespace, set, and user key.
import com.aerospike.client.Key;
String namespaceName = "test";
String setName = "dm101set";
Integer theKey = 0; // A key can be an integer, string, or blob.
Key key = new Key(namespaceName, setName, theKey);
System.out.println("Key created." );
Output:
Key created.
Record
Aerospike offers record-level ACID-compliance. That is, Aerospike
allows execution of multiple record-operations as one atomic,
consistent, isolated, and durable transaction by way of the operate
method.
The structure of a record is a Map containing:
- Metadata
- Expiration
- Last Update Time
- Generation Counter
- Map of Bins
Bin
A Bin is a flexible container that contains one data Value. A Value has an associated scalar or collection data type, however a Bin's data type is not formally declared in a schema.
Creating a Simple Record Containing An Integer and A String Bin
The following Java client code uses the key from the previous code example to put integer and string data into a record in Aerospike.
import com.aerospike.client.Bin;
import com.aerospike.client.policy.ClientPolicy;
String aString = "modeling";
Integer anInteger = 8;
String stringBinName = "str";
String integerBinName = "int";
ClientPolicy clientPolicy = new ClientPolicy();
Bin bin0 = new Bin(stringBinName, aString);
Bin bin1 = new Bin(integerBinName, anInteger);
client.put(clientPolicy.writePolicyDefault, key, bin0, bin1);
System.out.println("Put data into Aerospike: " + stringBinName + "=" + aString + ", " + integerBinName + "=" + anInteger);
Output:
Put data into Aerospike: str=modeling, int=8
Reading the Record
Uses the same key to read the record.
import com.aerospike.client.Record;
Record record = client.get(null, key);
System.out.println("Generation count: " + record.generation);
System.out.println("Record expiration: " + record.expiration);
System.out.println("Bins: " + record.bins);
Output:
Generation count: 1
Record expiration: 364672907
Bins: {str=modeling, int=8}
Collection Data Types
Lists and Maps are Collection Data Types (CDTs). These are flexible, schema-less data types that can contain Values of any data type, either scalar data or collection data. Collection Data Types can be nested as deeply as necessary to match an application’s needs.
A List is a collection of Values. For data efficiency, Lists are frequently used as tuples, a lightweight record structure using position instead of field names.
A Map is a collection of mapkey/Value pairs. Maps are commonly used for JSON-like data structures.
Because a Record contains one or more Bins, and a Bin or CDT can contain a scalar data type or collection data type, the most common question to consider when creating an application's data model in Aerospike is whether to store a class instance as a Record, Bin, CDT, or nested CDT.
Lists
Create a tuple and put it in Aerospike.
import com.aerospike.client.Value;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
ArrayList<Value> aTuple = new ArrayList<Value>();
aTuple.add(Value.get(9.92));
aTuple.add(Value.get("Carl Lewis"));
aTuple.add(Value.get("Seoul, South Korea"));
aTuple.add(Value.get("September 24, 1988"));
String tupleBinName = "tuple";
Bin bin2 = new Bin(tupleBinName, aTuple);
client.put(clientPolicy.writePolicyDefault, key, bin2);
Record record = client.get(null, key);
System.out.println("Put data into Aerospike: " + tupleBinName + "=" + aTuple);
System.out.println("After operation, Bins: " + record.bins);
System.out.println( tupleBinName + ": " + record.getValue(tupleBinName));
Output:
Put data into Aerospike: tuple=[9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]
After operation, Bins: {str=modeling, int=8, tuple=[9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]}
tuple: [9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]
Maps
Rather than use a simple tuple, this model needs a Map containing a list of Tuples. Reuse the Tuple Bin.
import java.util.HashMap;
String tupleMapKey = "world-records";
ArrayList<Value> tupleList = new ArrayList<Value>();
tupleList.add(Value.get(aTuple));
HashMap <String, ArrayList> wrMap = new HashMap <String, ArrayList>();
wrMap.put(tupleMapKey, tupleList);
Bin bin2 = new Bin(tupleBinName, wrMap);
client.put(clientPolicy.writePolicyDefault, key, bin2);
Record record = client.get(null, key);
System.out.println("After operation, " + tupleBinName + ": " + record.getValue(tupleBinName));
Output:
After operation, tuple: {world-records=[[9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]]}
Distinguishing Questions
From a modeling perspective, each Aerospike data model element is a potential mesh point with the application object model that will help you to store object instances from the application in Aerospike. The following questions instruct broadly how to fit them together. The italic text after the question explains the intuition to apply.
Questions related to Storage Medium.
Q: Does the application require a specific storage medium for a particular type of data, to achieve a necessary scale and frequency of reads or writes?
The easiest way to match data to hardware is to assign it to the right namespace. Namespaces associate index and data with storage media, like fast NVMe drives, persistent memory, or DRAM.
Q: Does the application need to store an integer or float with extremely frequent reads or writes?
Aerospike can store an integer or float in the primary index, instead of storing a memory location of the data. This provides even faster access than storing in DRAM media.
Questions related to Data Reads and Writes
Q: Does the application have an object class for which a large number of instances need to be stored in Aerospike and are frequently read together by the application?
There are two common options:
- If the instance size is very small (For example, the size is measured in bytes, not KiB) and rarely updated, then multiple such instances can be stored grouped in one or more Aerospike Records. The instances can be split among records using a naming convention based on how the application seeks to access the data. For example, IoT sensor readings are commonly written once, as observed, and then read by date range. They can therefore be stored by a primary key made from the object name and date.
- If the instance size is medium to large size, then the object is
better stored as a Record, and using a common Set name to group
them. The fastest way to read a group of application Records is
either:
- Scanning an Aerospike Namespace or Set of Records.
- Creating a secondary index and querying against that secondary index. At large scale, the time taken by parsing every Record in a Namespace as part of a scan is significant. Set indexes (added in version 5.6), a type of secondary index, will make scanning a Set much faster, if the Set is a small percentage of the Namespace.
Q: Do writes occur grouped into transactions or are individual pieces of data updated one by one?
Aerospike provides single-record transactions that are ACID compliant.
Store data requiring atomic updates in one or more Bins in the same
Aerospike Record, and use the Operate
API to execute a
multiple-operation transaction. If updates occur element by element,
data can be stored in one or more Records, or in one or more Bins of
data.
Q: During a single database transaction when updating data in an instance of an application object, are both of the following true?
- Reads are interspersed with writes
- Operations are executed on different parts of a record, as-if they were different objects
It can be helpful to store data as-if different different objects in separate Bins.
When applying transaction operations on a record using Operate()
,
the Aerospike client delivers the return values from operations per Bin.
These return values can be accessed in order, making transaction results
easier to work with when data is put in separate Bins.
Q: Is the size of a set of application records large? (For example, Measured in MiBs rather than in KiBs.)
There is an inherent trade-off in record size, as updating an app record will require a read, modify, and write of the entire Aerospike record. Consider storing the app record in more than one Aerospike record, rather than in a single monolithic record.
When data is large, taking advantage of an intrinsic property of the data, like a timestamp, can help to distribute data in an intuitive way across records. Including timestamp in a set name or user key name, for example, allows more efficient reads and writes. It will also allow graceful rotation of data.
Questions related to Deletes
Q: If there are no updates, can data naturally age out of the application?
It is common for applications to naturally allow data to expire after creation or update. Aerospike records have an Expiration metadata field that can be used to automatically expire data and reclaim storage space. All operations can be configured with a policy to set or update the Expiration.
An example of this is a bank keeping track of a customer's put stock option. An option grants the holder the right to make a stock transaction until a specified date that is determined at the purchase time. Once the expiration has passed, the option expires and the holder no longer has the right. The bank would model this in their computer systems using Expiration.
Q: Does the application require for a group of associated records that are created at distinct times to be removed at the same time?
The most common way to explicitly rotate out data at intervals is to store Aerospike Records in Sets and truncate the Sets from the associated Namespace, when appropriate.
An example of this is data that accrues over the course of a day, but then is worthless. One way to model the data would be to insert into a Set named for the day, and at the end of the day, the application would truncate the Set.
Questions related to Application Scale
Q: Does your application volume result multiple servers routinely suffering simultaneous downtime for disrepair or service?"
It is common for Aerospike clusters when the model is architected properly, to replace competing databases at a 1:5 (Aerospike:Other) ratio. When handling downtime, it will be important to configure whether Aerospike will run in AP mode or SC mode.
- AP or High Availability Mode – Prioritizes data availability for reads over data replication.
- SC or Strong Consistency Mode – Prioritizes writes and data replication across an Aerospike cluster over reads.
Notebook Cleanup
Truncate the Set
Truncate the set from the Aerospike Database.
import com.aerospike.client.policy.InfoPolicy;
InfoPolicy infoPolicy = new InfoPolicy();
client.truncate(infoPolicy, namespaceName, setName, null);
System.out.println("Set Truncated.");
Output:
Set Truncated.
Close the Client connections to Aerospike
client.close();
System.out.println("Server connection(s) closed.");
Output:
Server connection(s) closed.
Code Summary
Overview
Here is a collection of all of the non-Jupyter code from this tutorial.
- Import Java Libraries.
- Import Aerospike Client Libraries.
- Start the Aerospike Client.
- Create a Key using Namespace Set and User Key.
- Create Bins of Data.
- String
- Integer
- List
- Map
- Put Bins into an Aerospike Record.
- Get the Record from Aerospike.
- Truncate the Set.
- Close Client Connections.
// Import Java Libraries
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.HashMap;
// Import Aerospike Client Libraries
import com.aerospike.client.AerospikeClient;
import com.aerospike.client.Key;
import com.aerospike.client.Bin;
import com.aerospike.client.policy.ClientPolicy;
import com.aerospike.client.Value;
import com.aerospike.client.Record;
import com.aerospike.client.policy.InfoPolicy;
InfoPolicy infoPolicy = new InfoPolicy();
// Start the Aerospike Client.
AerospikeClient client = new AerospikeClient("localhost", 3000);
System.out.println("Initialized the client and connected to the cluster.");
// Create a Key using Namespace Set and User Key
String namespaceName = "test";
String setName = "dm101set";
Integer theKey = 0; // A key can be any value.
Key key = new Key(namespaceName, setName, theKey);
System.out.println("Key created." );
// Create Bins of Data.
// A. Integer
Integer anInteger = 8;
String integerBinName = "int";
ClientPolicy clientPolicy = new ClientPolicy();
Bin bin0 = new Bin(integerBinName, anInteger);
// B. String
String aString = "modeling";
String stringBinName = "str";
Bin bin1 = new Bin(stringBinName, aString);
// C. List
ArrayList<Value> aTuple = new ArrayList<Value>();
aTuple.add(Value.get(9.92));
aTuple.add(Value.get("Carl Lewis"));
aTuple.add(Value.get("Seoul, South Korea"));
aTuple.add(Value.get("September 24, 1988"));
String tupleBinName = "tuple";
Bin bin2 = new Bin(tupleBinName, aTuple);
client.put(clientPolicy.writePolicyDefault, key, bin2);
// D. Map
String mapTupleBinName = "maptuple";
String tupleMapKey = "world-records";
ArrayList<Value> tupleList = new ArrayList<Value>();
tupleList.add(Value.get(aTuple));
HashMap <String, ArrayList> wrMap = new HashMap <String, ArrayList>();
wrMap.put(tupleMapKey, tupleList);
Bin bin3 = new Bin(mapTupleBinName, wrMap);
// Put the Bins into Aerospike
client.put(clientPolicy.writePolicyDefault, key, bin0, bin1, bin2, bin3);
// Get the Record from Aerospike.
Record record = client.get(null, key);
System.out.println("Read from Aerospike –");
System.out.println("Generation count: " + record.generation);
System.out.println("Record expiration: " + record.expiration);
System.out.println( integerBinName + ": " + record.getValue(integerBinName));
System.out.println( stringBinName + ": " + record.getValue(stringBinName));
System.out.println( tupleBinName + ": " + record.getValue(tupleBinName));
System.out.println( mapTupleBinName + ": " + record.getValue(mapTupleBinName));
// Truncate the Set.
client.truncate(infoPolicy, namespaceName, setName, null);
System.out.println("Set Truncated.");
// Close Client Connections.
client.close();
System.out.println("Server connection(s) closed.");
Output:
Initialized the client and connected to the cluster.
Key created.
Read from Aerospike –
Generation count: 2
Record expiration: 364672908
int: 8
str: modeling
tuple: [9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]
maptuple: {world-records=[[9.92, Carl Lewis, Seoul, South Korea, September 24, 1988]]}
Set Truncated.
Server connection(s) closed.
Takeaways – Data Modeling is an Art and a Science
Data modeling with Aerospike is a science, but deep enough that it will seem like an art at first. An intuitive matching of your application object model with Aerospike's data model will generally result in a successful application.
When pushing the envelope of performance, do not hesitate to use additional resources. A great way to learn more about modeling is to, post questions to the data modeling discussion forum. This is especially worthwhile to optimize Aerospike performance for an application. In addition, discussing requirements with Aerospike's Solutions Architect team will still result in performance improvements and increase your ROI using Aerospike.
Knowing the Right Questions to Ask is the First Step
By nature, the above is incomplete knowledge on Modeling. This notebook may be updated with additional questions over time. Please submit feedback to help refine it.
What's Next?
Next steps
Have questions? Don't hesitate to reach out if you have additional questions about data modeling at https://discuss.aerospike.com/c/how-developers-are-using-aerospike/data-modeling/143.
Want to check out other Java notebooks?
Are you running this from Binder? Download the Aerospike Notebook Repo and work with Aerospike Database and Jupyter locally using a Docker container.
Additional Resources
- Want to get started with Java?
Download or
install the
Aerospike Java Client.
(https://aerospike.com/apidocs/java/com/aerospike/client/cdt/MapOperation.html). - What are Namespaces, Sets, and Bins? Check out the Aerospike Data Model.
- How robust is the Aerospike Database? Browses the Aerospike Database Architecture.