Glossary

The Aerospike schemaless data model gives application designers maximum flexibility. Aerospike uses the following terms to differentiate it from the relational database (RDBMS) world. In our documentation, we introduce Aerospike concepts with their corresponding common RDBMS terms.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

ACID compliant

ACID compliance refers to database transaction characteristics of Atomicity, Consistency, Isolation, and Durability (ACID).

Atomicity: Ensures that the transaction’s commands are executed in order, and all record modifications succeed. Otherwise, all record modifications fail and are rolled back to their state before the transaction.
Consistency: Ensures that transactions only make changes in predictable ways.
Isolation: Ensures that record modifications inside the transaction aren’t viewable from the outside, until it is committed, preventing concurrent transactions from interfering with each other. Aerospike transactions guarantee strict serializability.
Durability: Ensures that data is saved after a transaction is completed, even if there is system failure such as a power outage.

Adopting ACID principles puts Aerospike in industry-standard compliance with reliability, validity, and accuracy measures. These principles ensure that there is no data loss or corruption due to network errors, disruptions, or hardware failures. Industries that require ACID compliance include financial institutions, manufacturing operations, transportation, IoT environments, and energy production.

AQL

Aerospike Quick Look. A client built around a familiar and common query language. May be familiar to SQL users but does not maintain parity with SQL by design.

all flash

When the primary index and the data are both stored on NVMe flash devices rather than storing just the primary index in memory.

asadm

Aerospike admin tool. A multifunctional utility to extract and change configuration, configure authentication, and analyze performance and health information from a cluster or a collectinfo file. Python based.

asd

Aerospike Daemon. An Aerospike database process that is created by the user and runs on a server or node.

asmt

Aerospike Shared Memory Tool. Enables primary and secondary indexes to be backed up from shared memory to files and restored from files to shared memory. This allows the database to be restarted and the indexes restored, enabling a fast restart.

available mode

Aerospike’s default consistency mode represents an Available and Partition-tolerant (AP) mode from the CAP theorem perspective. In the event of a network partition, any sub-cluster of servers claims complete ownership of all data partitions. This is in contrast to Strong Consistency (SC) mode (CP).

batch operations

Batch operations are repeating computing tasks that can be kicked off and left unattended until they run to completion. The term batch operations arose when punched cards were used to tell computers what to do when performing more than one program. When multiple directions were needed, these cards were run in batches.

In the database world, batch operations refers to the processing of a large number of like tasks (batch reads and batch writes are most common) instead of processing each task separately. Batch updates also fall under batch operations. Batch updates are sets of multiple update statements that are submitted to the database for processing as a batch.

Batch operations usually save compute resources and time because executing a hundred (or a million) individual reads or writes usually takes much longer than executing those operations in a batch.

batch

A batch transaction is when you have the key or digest and know the records you want to access, which are sent together directly to the relevant nodes. A batch groups multiple operations into one unit and transmits them over a single network socket to each cluster node.

batched commands

Batch commands are repeating computing tasks that can be kicked off and left unattended until they run to completion.

Batched operations refers to the processing of a group of commands such as reads, writes, and deletes, instead of processing each task separately. Batch updates also fall under batched commands. Batch updates are sets of multiple update statements that are submitted to the database for processing as a batch.

Batched commands usually save compute resources and time because executing multiple individual reads or writes usually takes much longer than executing those operations in a batch.

big data

Big data refers to the recent trend of managing and processing increasingly large structured and unstructured datasets that are available to businesses. Big Data has been characterized by the alliteration of the “4 Vs of Big Data” – volume, velocity, variety and veracity (sometimes a fifth V is added for “value.” Big data is also arriving from different sources (vehicles, wearables, appliances, artificial intelligence), making it a challenge for traditional relational databases to handle with low latency.

Big data is important because of how it can be used and the vast and growing collection of new, exciting use cases that it has inspired. Through analysis, the data can show how to improve business inefficiencies, predict user and market behaviors, or to create new revenue streams and markets. Businesses can use big data to figure out why a product or service failed, detect fraud early and recalculate risks. More and more data is used in machine learning and artificial intelligence applications, which in turn will drive further data growth.

Examples of big data can include social media analysis, the stock exchange simulations, and to analyze complex systems and machines such as jet engines, oil derricks, and traffic systems. The application of big data is nearly limitless in scope and potential.

bin

A sub-object of a record in Aerospike. Each bin has a data type, which does not need to match the data types of bins in other records. In the Aerospike database, each record (similar to a row in a relational database) stores data using one or more bins (like columns in a relational database). The major difference between bins and RDBMS columns is that you don’t need to define a schema. Each record can have multiple bins. Bins accept these data types (which are also referred to as “particles” in documentation and messages about bins):

Boolean
Bytes
Double
Geospatial
HyperLogLog
Integer
List
Maps
String

For information about these data types and how bins support them, see “Scalar Data Types”

Although the bin for a given record or object must be typed, bins in different rows do not have to be the same type. There are some internal performance optimizations for single-bin namespaces.

CAP theorem

CAP theorem states that distributed systems can provide at most two of the following three properties: Consistency, Availability, and Partition Tolerance. In Aerospike, you can choose Available and Partition-Tolerant (AP) or Consistent and Partition-Tolerant (CP), which is known as strong consistency (SC) in Aerospike. Consistent and Available (CA) cannot be implemented in practice for distributed systems as consistency and availability fail during partition events.

client

A library included by the user’s application, which provides an API that allows the application to perform operations against the Aerospike database cluster. In our documentation, client, API, and application are used interchangeably. The client is written in a language such as Java, C, C#, Go, Python, Node.js, Ruby, Rust and others

cloud based database

A cloud based database is one that is built, deployed and accessed in a public, private or hybrid cloud. A cloud based database has similar functions to a traditional database, but offers greater flexibility with cloud computing.

Among the other benefits of a cloud based database is that users can host databases without buying dedicated hardware. It also enables the user or the provider to manage it and is easy to access through a web interface or an API provided by the vendor.

In addition, a cloud based database can also support relational and NoSQL databases. Its storage can be scaled up to handle a growing demand or be decommissioned quickly if a project is abandoned.

cloud managed services

Cloud managed services are a set of applications or utilities that are provided to end users typically using a web interface in the cloud. These services can provide a wide range of business or technical functions that hide the complexity of cloud platform management and control. This can include migration, maintenance and optimization.

Cloud managed services can help businesses achieve greater agility and efficiency with essential business processes without having to hire and train a technical team to keep those systems running.

Cloud managed services typically operate in public and hybrid cloud environments. One organization may decide to have their entire infrastructure in the cloud, while others only want their CRM solutions. Cloud managed services can take on various tasks, such as engineering on demand, operations management, continuous support, hosting and implementation.

The benefits of cloud managed services include lower infrastructure costs, elastic scalability, more predictable pricing, automatic upgrades, disaster recovery support, and enhanced availability and security.

cluster

Aerospike is a distributed database, made up of a collection of one or more database nodes, a cluster. The cluster acts together to distribute and replicate both data and traffic. Client applications use Aerospike APIs to interact with the cluster, rather than with individual nodes. This means that the application does not need to know cluster configuration. Data in the cluster is evenly distributed to ensure an even resource consumption on the nodes. As you add or remove nodes from the cluster, it dynamically adjusts without needing any application code or configuration changes.

cold start

A restart scenario where, after shutdown, disks are scanned, and the index is rebuilt from storage. Records not durably deleted may resurrect, potentially reverting to older states. This contrasts with a fast restart.

collection data type (CDT)

Collection data types (CDTs) are flexible, schema-free containers that can hold scalar data or nest other collections within them. The elements in a CDT can be of mixed types. CDTs are a superset of JSON, supporting more data types as elements of the collection, such as integers for map keys and binary data as list values.

consistency

Requirement in database systems that any given database transaction must change affected data only in allowed ways. Any data written to the database must be valid according to all defined rules.

cross-datacenter replication

Cross-datacenter replication (XDR) lets data be reproduced – or replicated – across clusters that can be located in different clouds and various data centers.

The replication guards against data-center failure. It’s also used to supply high-performance access to globally distributed applications that are mission critical. Cross-datacenter replication guarantees continuous service because if one of the data centers has a problem, there is backup data in another center.

Once replications are established, they continuously replicate until paused or deleted.

The telecommunications industry relies on cross-datacenter replication because data availability, consistency, resilience and low latency are critical.

DBaaS

DBaaS (database as a service) is a cloud computing managed service that provides various database services without having to understand the underlying hardware, software, or database operations.

DBaaS providers host the database infrastructure and typically provide a web interface to add and query data, although they often also provide access to the data using standard tools or special APIs. These providers take care of scalability, resilience, restoration, security and maintenance. They often offer 24/7 support and geo-replication for availability and backups.

The benefits of DBaaS is that it’s simpler to deploy, and thus more immediate, and sometimes more cost effective. This can lead to faster and deployments for developers and businesses and provide greater agility to business operations. DBaaS can be an attractive option for small businesses and startups that do not own data centers or racks of computers.

data intensive applications

Data intensive applications handle large quantities of data (multiple terabytes and petabytes) that can be complex and distributed across various locations. Data intensive applications process data in multistep analytical pipelines, including transformation and fusion stages.

Some examples of data intensive applications include stock trading applications, user behavior analysis, market simulations, and digital marketing. A stock trading application needs user account information access and also information about the market and portfolios. In digital marketing, there may be several campaigns running at one time, in addition to using demographic information to target specific ads to specific consumers.

When looking at data intensive applications, it’s important to consider the optimal methods of handling high volumes of different types of data, scalability, resilience and security.

data pipeline tools

Data pipeline tools are used to automate data extraction, cleaning and loading in order to make the process more efficient, reliable and secure. It makes ingestion from various data sources to a single destination easier and more consistent.

There are free data pipeline tools like FOSS (free and open-source software) that can be customized to fit specific use cases. However, it can be more difficult to scale FOSS, and there is a lack of technical support.

Data pipeline tools are important because they take massive amounts of raw data and transform it into data that is ready for analytics, data apps and machine learning systems. For example, data pipeline tools can be used to deliver sales data to sales and marketing as part of a customer 360 initiative, or recommend financial services to a small business owner.

data pipeline

In analytics, a data pipeline is a collections of systems that covers the entire data journey from extraction from the data sources, to ingest into a file system, database, or storage service, to the ETL systems that transform and prepare data for analysis, to the analytics data processing engines, and the output of data to dashboards, BI tools, and data applications.

Many organizations have several, even hundreds of data pipelines that service different lines of business or use cases. Effective design and implementation of data pipelines helps organizations gain better and more insights by effectively capturing, organizing, routing, processing and visualizing data.

As more data becomes available from more sources, creating effective data pipelines is essential to connecting and coordinating different data sources, storage layers, data processing systems, analytics tools and applications. Since data scientists, developers and business leaders may all want to work with data in different ways, a flexible data pipeline architecture is essential so that relevant details for each team can be gathered, stored and made available for whatever analysis is needed.

Design goals of an effective data pipeline architecture is that it’s scalable, flexible, cost-effective and optimized for a wide variety of analytical tasks.

data storage layer

A data storage layer is where your gathered data is stored and saved for when it is needed. There are four layers in data warehouse architecture: data source layer, data staging layer, data storage layer and data presentation layer. The data storage layer makes it easier to back up files to ensure they remain safe and can be recovered quickly if computer hackers strike or there is some sort of outage.

In the data storage layer, the data is cleaned, transformed and prepared with a specific structure. This enables access by those within a business who require the data for various reasons.

data synchronization

Data synchronization is required when two or more systems want to access and manipulate the same datasets with accuracy and consistency. Data synchronization can take place in memory in the case of a traditional relational database, or it may be required with datasets that are widely distributed – in different cities, regions, or data centers.

In order to achieve effective Data synchronization, a database/data platform must prepare and cleanse data, check for errors or duplication and then ensure consistency before it can be distributed, replicated, and synchronized. This is important because if synchronized data is changed by any replica, those updates must be reflected throughout the system to avoid errors, prevent fraud, protect private data and deliver accurate, up-to-date information and insights.

Data synchronization is becoming more vital as the population grows mobile and globalization continues. Data synchronization is also important with the growing accessibility to cloud-based data.

Some of the data synchronization methods include data replication in databases, file synchronization – typically used for home/cloud backups – and version control methods to synchronize files that might be changed by more than one user simultaneously. A distributed file system usually requires that devices be connected in order to sync multiple file versions. Mirror computing provides different sources with the same copy of the data set.

defragmentation

When records are updated or deleted, the percentage of active records in a previously written block may fall below the defrag-lwm-pct threshold. This makes the block eligible for defragmentation, where records from partially empty blocks are read and rewritten to a new write block to optimize space and access efficiency.

demand-side platform

A demand-side platform is a marketing automation tool that helps mobile advertisers buy mobile, search and video ads from a marketplace where publishers list ad inventory. A demand-side platform provides a way for managing ads across various real-time bidding networks.

Demand-side platforms run independently of networks like Facebook or Instagram. As third-party software, they provide advertisers with one place to buy, analyze and manage advertising across many networks.

One of the advantages of using a demand-side platform is greater efficiency. Since advertisers only have to use one dashboard, more information is available than from a single network. Ads can also be better targeted using the available data, which can lead to higher conversion rates.

Demand-side platforms are considered an important tool to mobile marketing because they are automated and provide a way for campaigns to easily be set up and managed. In addition, campaign performance can be seen in real time, providing a way for advertisers to make changes as needed to gain the greatest benefit.

demarshaling

Also known as deserializing. This process converts a serialized data structure, such as from incoming network communication, into an internal data structure. The reverse operation, converting an internal structure to a serial format, is called marshaling or serializing.

digest

The Primary Index Digest is a 20-byte unique object identifier created on the client side by hashing the user key and, if available, the record’s set name using the RIPEMD-160 algorithm, which takes a key of any length and always returns a 20-byte digest. By default the record saves the digest but not the key, which saves storage for long keys over 20 bytes.

distributed SQL

Distributed SQL is the ability to query a single logical relational database across multiple servers (clusters) with standard SQL syntax. Distributed SQL databases have strong consistency across clusters, data centers, or other geographic/availability zones. Distributed SQL databases are important because they are capable of scaling out quickly by adding additional cluster nodes and can therefore handle very large datasets.

Distributed SQL is suited to use cases where dramatic surges and troughs of activity are common, such as ecommerce sites that experience large surges of activity during holidays, or betting sites that experience an avalanche of activity during large sporting events.

Distributed SQL databases provide ample headroom and enough capacity to handle such sudden high demand at optimal operating cost, scaling infrastructure back down after the big game or holiday. Other uses for distributed SQL include streaming media that requires large amounts of data to customize offerings for users. The flexibility offered by distributed SQL can help eliminate downtime and lead to cost savings when users can quickly scale up or down, depending on their needs.

edge data

Edge data is data that is created as a result of edge computing processes, which is done at or near the physical location of the user or the source of the data. Being at the edge has connotations of limited network bandwidth and being outside the perimeter of data centers and the cloud.

The benefit of placing computing services closer to locations such as bank branch offices – or even oil derricks – is that local users and analysts get immediate and more reliable services and insights. With an effective edge computing mechanism, data can be cleansed and transformed at the point of origin, thus reducing the amount of ETL work done by core systems.

As more organizations with remote locations are trying to handle growing data volumes, edge computing provides a way to apply storage and compute resources in the most efficient and cost effective manner.

fast restart

Also known as warm start, this feature in Aerospike Enterprise Edition allows the server to store the primary index and other critical metadata in Linux shared memory. When restarting the Aerospike server (asd), this metadata enables a quick rebuild of the index without fully regenerating its state from the storage drive. The Aerospike Shared Memory Tool (ASMT) can be used to save metadata to disk for fast restarts across reboots.

five-nines uptime

Five-nines uptime – or 99.999% – refers to the amount of time a network or service is available to users or other systems over a certain period, usually a year. This means there will be about 5.26 minutes of total downtime, either planned or unplanned.

Six-nines (99.9999%) = .526 minutes
Five-nines (99.999%) = 5.26 minutes
Four-nines (99.99%) = 53 minutes
Three-nines (99.9%) = 8 Hours and 46 Minutes
Two-nines (99%) = 3 Days, 15 Hours, 36 Minutes

Five-nines uptime is achieved by adding redundancy, failover, fast restart/respawn of processes so that no single component or combination of component failures can crash the entire system. Steps are also taken to ensure the crossover between redundant systems doesn’t become a failure point. The availability is also enhanced when failures are detectable as they occur and reliance on staff is reduced in order to cut human error.

Five nines of uptime is becoming more critical for organizations that rely on high operational performance, such as hospitals or data centers. Practically speaking, five-nines and above uptime is considered “always on”.

heartbeat

Messages exchanged between nodes in an Aerospike cluster to monitor and detect changes in cluster membership. The heartbeat protocol can be multicast (using IP multicast) or mesh (using configured IP addresses of peer nodes to connect).

high availability database

A high availability database is a database that is designed to operate with no interruptions in service, even if there are hardware outages or network problems. High availability databases often exceed even what’s stipulated in a service level agreement.

A high availability database ensures greater uptime by eliminating single points of failure, ensuring a reliable crossover between redundant systems and detecting failures right away, such as through environmental problems or hardware or software failures.

Typical high availability database features include server or node failover, hot standby, data replication and distributed microservice architecture.

Many businesses today have critical databases and applications, such as data warehouses and ecommerce applications that require high availability. High availability databases are important to reduce the risk of losing revenue or dissatisfied customers.

hotkey

A hotkey (also hot key or hot-key) is a specific key or digest subjected to a disproportionately large number of read/write operations in a short time window. This can occur when multiple clients or processes attempt to access or modify the same data element simultaneously, leading to a concentrated workload on a single node.

When a server node receives too many concurrent requests for the same key, it may reject the request with a KEY_BUSY error to avoid uneven load distribution. This also increments the fail_key_busy statistic for monitoring such scenarios.

Write hotkeys are logged, while read hotkeys are logged only in strong consistency mode. You can also enable key logging by setting the rw-client logging context to detail for hotkey analysis.

hybrid storage

Hybrid storage is a storage strategy that blends the use of flash storage, solid state drives (SSDs) and mechanical disk drives in order to provide the optimal combination of cost and performance for a given set of workloads. A hybrid storage approach enables a myriad set of different applications and use cases to have the storage performance they need at the right price point provided by the hybrid storage platform.

One of the benefits of hybrid storage is that it enables organizations to leverage high performance storage – such as flash drives or SSDs – when it is needed. Organizations can determine whether data is hot, warm or cold and then choose the most appropriate storage medium for the application. This enables businesses to craft a plan about how data will be used and when to achieve the greatest impact and efficiency.

Hybrid storage can sometimes be implemented in a single storage system. This offers users a single point of accountability for hardware and software issues. This can be important when businesses are looking for greater efficiency when data volumes are increasing and storing everything on flash storage can be too expensive.

Index

A data structure or mechanism that improves the speed and efficiency of data retrieval operations within a database or other data storage systems.

key value NoSQL database

A key value NoSQL database used to describe a new generation of non-relational databases that use a key-value method to store data as a collection of key-value pairs in order to get fast lookup results on very large datasets. In this way, the key becomes the unique identifier. A key value NOSQL database is considered the simplest type of NoSQL databases.

A key value NoSQL database offers rapid data storage (writes) and information retrieval (reads) because of its simple data structure and lack of a predefined schema. It also has a high performance because of common use of integrated caching features that enables users to store and retrieve data very quickly. Because of its relative architectural simplicity, a key value NoSQL database can or scale out quickly in cloud environments without causing operational disruptions.

key

The unique identifier of a record in Aerospike, similar to how a primary key in an RDBMS identifies a single record in a table. By default, the key is not stored with the record to optimize storage. The key is the distinct (set, userKey) pair in a specified namespace. The userKey data type can be a string, integer or bytes (blob). For example in a namespace user_profiles, a specific user record can be identified by the key (eu-users, 'foo@gmail.com').

KNN (exact)

An exact k Nearest Neighbor search, which is an exhaustive search technique that gets the best results but can take a long time to perform.

Linearizability

Linearizability is one of the strongest single-object consistency models, and implies that every operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations. This means that if operation A completes before operation B begins, then B should logically take effect after A. Linearizability is a single-object model, but the scope of an object varies. Some systems provide linearizability on individual keys in a key-value store; others might provide linearizable operations on multiple keys in a table, or multiple tables in a database—but not between different tables or databases, respectively. When you need linearizability across multiple objects, try strict serializability. When real-time constraints are not important, but you still want each process to observe a logical order (but allows other processes to observe a different logical order), try sequential consistency.

low latency algorithmic trading

Low latency algorithmic trading is a process for carrying out orders using automated and pre-programmed trading directions to account for different prices, timing and volume. Faster execution is achieved through low latency, which delivers data under a millisecond in order to make faster decisions.

There are many factors that can impact the low latency of algorithmic trading, such as the distance between the exchange and the trading system and the efficiency of the trading system architecture. This architecture might include network adaptors, the operating system choice, code efficiency and programming language.

Algorithmic trading is mostly used by institutional investors and big brokerage firms to reduce the expense associated with trading.

lut

Last Update Time of a namespace or set, measured in nanoseconds since the last update.

master record

The primary copy of a record in a namespace with a replication factor. For example, with a replication factor of 2, there is one master record and one replica. Writes occur on the master, which may also be referred to as the write-master, while the replica is known as the write-prole.

migration

When nodes are added or removed from a database cluster, data migrates between the remaining nodes. After migrations are complete, the data in the new cluster is evenly distributed. Migrations occur when the cluster topology changes, such as when a node is added or removed or during network issues. During migrations, record data moves as part of the partition it is mapped to via its key hash.

monitor

The monitor keeps track of the records locked and written by a transaction. If the client goes away without fully committing or aborting the transaction, the monitor steps in after the transaction timeout deadline to prevent it from dangling.

multi-cloud environment

A multi-cloud environment is where more than one cloud computing is used. It might be a combination of public, private or edge clouds. These clouds may be used in various combinations in order to distribute applications and services.

A multi-cloud environment, for example, might be used to speed up the delivery or transformation of apps. Or, an enterprise might consider spreading across different clouds so that it’s not dependent on one vendor. Some applications, such as logistics, retail and manufacturing may need to be distributed at the edge to be physically closer to users and deliver faster results or better customer experiences.

multi-model database

A multi-model database is a database management system that supports different data modeling approaches in a single storage engine. This provides a single backend database that can service a wider range of data processing and data retrieval tasks and use cases. This differs from most database management systems that are organized around a single data model, such as relational, key-value, wide-column, document, or graph, that decides how data is organized, stored and manipulated.

Aerospike is a multi-model database supporting key-value, wide-column, document and graph data modeling.

NewSQL

NewSQL is a relational database management system (RDMS) that aims to provide NoSQL system scalability while also maintaining the consistency of a traditional database system.

NewSQL combines ACID (atomicity, consistency, isolation and durability) compliance with horizontal scaling for online transaction processing workloads. Enterprise systems that handle data, such as financial and order processing systems, are too big for a traditional relational database. At the same time, these enterprise systems aren’t practical for NoSQL systems because they have transactional and consistency requirements. NewSQL provides the scale and reliability without requiring more infrastructure or development expenditures.

NewSQL uses SQL to ingest new information, execute transaction processing at a large scale, and change the contents of the database. The main categories of NewSQL include new architectures, transparent sharding middleware, SQL engines and database as a service (DBaaS).

NoSQL .NET database

A NoSQL .NET database means the database is written in .NET, which is a no-cost, open-source cross platform for building different applications. .NET enables different languages, editors and libraries to build for mobile, web, games and the Internet of Things (IoT).

For example, .NET can be written in languages of C#, F# and visual basic. Whatever the chosen language, .NET will run natively on any compatible operating system. This enables many different types of apps to be built. .NET also has a set of base class libraries and APIs that are common to all .NET applications.

.NET is popular with software developers and was built by Microsoft for building many different types of applications.

NoSQL database design

NoSQL database design is focused on how an application will query the data, rather than concentrating on the relationships within the data.

NoSQL database design stresses access patterns over abstract data models. That’s why best practices for NoSQL database design call for a graph of the ways that applications will query the data, and the necessary workload support.

NoSQL database design also looks at how often the dataset will be changed, how much data will be stored and the requirements for availability, performance and consistency.

NoSQL database design means choosing the right type of database for a certain application. These database types can be key-value stores, wide-column stores, document databases and graph databases.

NoSQL document database

A NoSQL document database is a NoSQL database that can store, retrieve, and manipulate document-oriented (or also known as semi-structured) information. Document databases are more efficient, intuitive, and flexible at handling this semi-structured data than relational models because relational databases must convert documents into relational tables (rows and columns) to store and manage them. Modern NoSQL document databases are also designed to scale out in server clusters and cloud infrastructure.

Instead of storing data in fixed rows and columns, document databases use flexible data models like JSON or JSON-like data structures. The semi-structured nature of document data means that every document object in the database can have a unique structure. This means that users can add new objects without changing the entire database. In addition, users can customize documents to have the same or different structures.

NoSQL graph database

A NoSQL graph database is designed to handle huge sets of structured, semi-structured or unstructured data. A NoSQL graph database can integrate heterogeneous data from a variety of sources and make links between different datasets. It does this by focusing on the relationships between different entities and then surmising new knowledge from information on hand.

The NoSQL graph database is more flexible than a relational database, and also considered more dynamic and less expensive. Its ability to handle massive loads of unstructured data that can come from areas such as the Internet of Things (IoT) is also considered an advantage.

namespace

A top-level data container in Aerospike. It is a physical collection of similar records within a storage engine that share common policies, such as replication factor, encryption, and storage type. A namespace is similar to a tablespace in an RDBMS. Aerospike database clusters contain one or more namespaces. Namespaces segregate data with different storage requirements. For example, some data may have high performance/low storage requirements more suitable for RAM, while other data can be stored on SSD storage. The Aerospike schemaless data model allows you to mix data types within a namespace. You can store data on users and URLs in the same namespace, and separate them using sets.

node

An Aerospike database cluster is made of one or more nodes. These are the individual servers that act together as a distributed database to fulfill client requests. Each node holds a portion of the data and contributes to the overall computing power of the cluster.

nsup

Namespace supervisor, the main server thread responsible for handling expirations and evictions within a namespace.

operational workload

Operational workload refers to an application’s ongoing work and what it is being asked to do. When considering an operational workload, issues that are considered include what data is being processed, how that data is processed and whether it is in a structured or unstructured environment.

Other considerations to determine an operational workload can include the data volume during a specific period, how much effort has to be put into it and the time it will take to repeat that effort. Operational workload will look at duty cycles and working set sizes to make these determinations.

Determining the correct operational workload is important in order to create a more effective design and operation, while also optimizing the workloads.

particle

Synonym for “data type” in documentation and messages referring to bins. For example, “Boolean particle” means “Boolean data type” in reference to bins.

policy

Policies control the behavior of individual operations against the database, such as reading records, performing read and write operations on distinct data types within the record. They also dictate the operational behavior of a namespace or the entire database node or cluster.

primary index

Primary key index (PI) is a set of 20-byte RIPEMD-160 hashes created from the set and identifier portion of the record key tuple (namespace, set, identifier). The first 12 bytes of this hash determine the partition within the namespace to which the record is assigned. The hash is stored in a hash table that links to a red-black tree data structure called a sprig, containing the data location metadata. When looking up a record using its primary key, a digest is created from the set and identifier, allowing the client to locate the partition and node. Once the request is received, the digest is used to find the record entry in the hash table and retrieve the metadata to access the full record.

pristine blocks

Blocks that have never been written to by Aerospike. Aerospike prioritizes writing to blocks that have been cleared by defragmentation before using pristine blocks. This improves cold start performance as unwritten blocks can be skipped during indexing.

provisional record

A record that was locked during a transaction.

query languages

Query languages are programming languages for searching a database or dataset, changing its contents, or retrieving information. ANSI SQL is the best known and most widely used query language, but the Big Data revolution introduced many more specialized query languages – especially for NoSQL databases. While early query languages required database expertise to use, the interfaces have evolved and made it possible for anyone to access database information.

The main types of query modes are the menu (choose from a prescribed list), the fill-in-the-blank technique (use keywords in the search feature) and the structured query. The structured query is often used with relational databases and has a formal syntax that is considered a programming language.

Another of the query languages is natural language, which is seen as the most flexible and is allowed in some commercial database management software. This natural query language looks for action words and synonyms, and identifies the names of file, records and fields.

query

A request for all records matching specific criteria. Queries can be performed against a primary index (key) or a secondary index (bin value). Primary index queries support read-only operations like fetching records from a namespace or set or those before a given last update time (LUT). They can also perform background read-write queries with UDF-defined actions. Secondary index queries locate records by bin values, with the number and location of matching records often unknown at the time of the query.

Retrieval-augmented Generation (RAG)

Retrieval-augmented Generation (RAG) is a model that combines retrieval-based and generation-based approaches to generate more accurate and informative responses by augmenting generative models with retrieved documents.

real time data analytics platforms

Real-time data analytics platforms give organizations a way to use their real-time data by enabling extraction of valuable information and trends. They also provide better analytics and visualization by connecting data sources.

By measuring data in real time, businesses can make decisions based on the latest information. Because real-time analytics was once time-consuming and expensive, it was used in only the most mission-critical cases. Now, the growth of real-time data from all kinds of connected devices, and the emergence of the cloud, means real-time analytics is much more accessible.

Real-time analytics platforms are serving a wide variety of industries. For example, the logistics industry can use it to track shipments and optimize routes. More organizations need data faster to enable better predictions in order to stay competitive in a hyper-connected and more competitive world.

real time data management

Real time data management entails real-time processing that handles workloads that continually fluctuate. For example, a stock market requires real time database management because it’s always changing. This is different from a traditional database with persistent data that isn’t usually affected by time.

Real time data management is also used in other industries such as banking, law, medical records, multimedia, accounting, reservation systems and scientific data analysis. These databases require speed so that data can be processed, results provided and immediate action taken. For example, an airport radar system needs data to be immediately processed so that it’s clear in real time where various airplanes are located.

real time database

A system using real-time processing to handle ever-changing workloads. Real-time databases are traditional databases that are used in fields such as banking, law, medical records, multimedia and science. A stock market is an example of a real-time database because it is dynamic and changes rapidly.

The term real-time database applies to databases that handle data streaming in real time, including in-memory data grids, in-memory databases, NewSQL databases, NoSQL databases and time-series databases.

One of the benefits of a real-time database is being able to store data that enriches streaming data. A real-time database also enables continuous queries to process ongoing events from people, apps and machines. Instead of the data growing stale, it can be used immediately.

real time web applications

Real-time web applications are apps that enable interactive usage by users, systems, or applications. They operate within a time frame of under a second, or even a millisecond, enabling users to get information as soon as they ask for it. This means that users do not have to check on the information themselves or rely on software to check periodically for updates.

Real-time web applications can be things like instant messaging, gaming, status updates, alerts, and dashboards. The term real-time is often debated

record block

The initial landing spot for an incoming written record. One record block holds only one record, though one record can span multiple blocks.

record

An object containing data identified by a single key, similar to a row in an RDBMS. Each record is stored in a partition and optionally in a set.

record/object

A record (or object) is similar to a row in an RDBMS. It is a contiguous storage unit for all the data uniquely identified by a single key. A record is subdivided into bins (like columns in an RDBMS)

replication factor (RF)

The number of copies of each record maintained in a namespace.

roll back

An abort of a transaction due to a failure of some sort. An abort rolls back the transaction by removing the provisional records (records that were locked during a transaction) Roll back is a component of a commit. Rolls are performed by either the client or the monitor. The last step of a commit is to delete the monitor record.

roll forward

A commit of the provisional records (records that were locked during a transaction). The roll forward command removes the older version. Roll forward is a component of a commit. Rolls are performed by either the client or the monitor. The last step of a commit is to delete the monitor record.

rw-hash

Replica Write hash, a structure used to park transactions that require coordination with another node before responding to the client. It is used for write transactions, read transactions during migrations, and for parking read transactions in strong consistency-enabled namespaces.

SC mode

Strong consistency mode ensures that while writes in progress can be reordered, the read mode determines the order in which the app observes the writes. Sequential will see a progressing record version order for each record but the order across records may differ from client to client. Linearizable ensures that all clients see the records progress in the same order. From the CAP theorem perspective, it represents Consistent and Partition-tolerant (CP) behavior, as opposed to Aerospike’s default Available and Partition-tolerant (AP) mode.

SQL distribution

SQL distribution means a single logical database is deployed on a cluster of servers in one or more data centers. The SQL distribution is known for strong consistency, high availability, resiliency and distributed use of data across different geographic environments.

SQL distribution provides a seamless developer and customer experience. For example, developers don’t have to worry about ACID (atomicity, consistency, isolation, durability) compliance or complex joins. Users can rely on better performance and scalability as data grows.

SQL distribution does have a strict schema and structured data.

sequential consistency

Sequential consistency is a strong safety property for concurrent systems. Informally, sequential consistency implies that operations appear to take place in some total order, and that that order is consistent with the order of operations on each individual process. A process in a sequentially consistent system may be far ahead of, or behind, other processes. For example, they may read arbitrarily stale state. However, once a process A has observed some operation from process B, it can never observe a state prior to B. This, combined with the total ordering property, makes sequential consistency a surprisingly strong model for programmers. In Aerospike, only cluster disruptions cause linearizability and sequential/session consistency to differ. While the cluster is stable without any migrations, they are indistinguishable from one another other than the linearizability performance penalty. When you need real-time constraints, such as a case where you want to tell some other process about an event using a side channel, and have that process observe that event, try linearizability. When a shared total order across all processes isn’t required, use sequential consistency.

serializability

Serializability is a transactional model. Transactions can involve several primitive sub-operations performed in order. Serializability guarantees that operations take place atomically where a transaction’s sub-operations do not appear to interleave with sub-operations from other transactions. It is also a multi-object property where operations can act on multiple objects in the system. Serializability applies not only to the particular objects involved in a transaction, but to the system as a whole. Also we provide strict serializability and currently do not allow relaxing this. We do not allow scans and queries as part of transactions.

server footprint

A server footprint is the amount of space – either physically or online – that computer hardware or software occupies. This might entail equipment such as servers, switchers, routers and storage in a facility. Software might include how much memory is required to run a program.

Server footprints are increasing with digital transformation and more devices supplying data or demanding online connections, such as streaming channels, online banking or smart vehicles. Some organizations are looking to reduce their server footprint to lesson the impact on the environment and save money.

service thread

A worker thread on a cluster node responsible for receiving client requests and executing transactions.

set index

A logical subset of records within a namespace. Set indexes reduce the number of full primary index scans needed to find a record. They are most effective for sets smaller than 1% of the namespace.

set

An optional method of logically grouping records within a namespace using a record attribute. Sets function like tables in an RDBMS but do not require a schema. A set is not a distinct storage unit, but instead it is a collection of records within a namespace. The namespace does have its own dedicated storage.

sindex

Secondary Index (SI) locates records within a namespace or set by a bin value. Each node builds its own sindex with references only to local data. A secondary index can include both master and replica records.

sprig

A memory-based binary tree data structure used by Aerospike to store and retrieve primary index data.

storage engine

The physical storage medium and the method by which data is written to the medium.

strong consistency

Referred to as SC mode in Aerospike Database. Guarantees that all writes to a single record will be applied in a specific, sequential order, and writes will not be re-ordered or skipped to ensure that data is not lost.

subset

A flag used during migrations to indicate that a partition is not yet full. This flag is removed when migration to the partition completes. For example, during an add-node operation, both the replaced and new partitions are marked as subsets until migration finishes.

supply-side platform

A supply-side platform is an advertising technology platform used by publishers to manage, sell and optimize their ad space (or inventory), using websites and mobile applications. These ads can be video or display ads.

Supply-side platforms connect directly to ad networks, data-management platforms, demand-side platforms and ad exchanges to sell ad inventory for websites and app owners.

Supply-side platforms are beneficial for publishers who may be managing complex and volatile programmatic ad purchases with different ad networks at one time. These platforms help ensure the various requirements and limitations for those ad networks are met.

Supply-side platforms may use advanced algorithms to predict which network provides the most effective results during a certain period.

system metadata

Also known as SMD, system metadata stores critical system information such as secondary indexes, user-defined function definitions, user permissions, and eviction data. It is typically located at /opt/aerospike/smd on the node.

tending

The process by which the client discovers the cluster’s addresses and maps partitions to nodes. Tending begins with a seed connection, where the client retrieves a list of cluster node addresses, partitions, and generations. The client regularly checks for partition updates and monitors socket usage.

time based graph database

A time based graph database (or time series database) is one that is built specifically for handling metrics, events or measurements that are time-stamped. It stores nodes and relationships instead of tables or documents.

Graph databases are often used in fraud detection and recommendation engines. A graph database helps determine relationships between potential purchasers, personal information such as an email address and what purchases the user is making similar to others with common interests.

Further information can be gathered from the time series, which can track measurements or events that are tracked and aggregated. Examples include clicks, trades in a market or application performance modeling. While financial data was one of the initial uses of a time series database, the focus has grown with sensors being included in everything from cars to microwaves to phones.

transaction ID, TxnID, TRID

In Aerospike, a transaction ID (TxnID or TRID) is the identifier for all the commands inside a distributed transaction. A TRID refers to an identifier the server return to the client after it launches a query, for the purpose of monitoring its progress.

transaction id / TR ID

An identifier returned by the node to the client for a query. In Aerospike versions post-6.0, this ID is returned immediately.

transactional workload

A transactional workload means that over time, the database is getting requests for data and various changes to that data from different users. The modifications that are made are known as transactions.

For example, a transactional workload is built to aid in transactions such as in banking or accounting systems. Relational databases such as MySQL were designed to handle transactional workloads. They can scale as needed, ensure transactional consistency and have quick, responsive queries.

tsvc

The transaction service (tsvc) subsystem implements the execution of read/write commands, including transactions, queries, and info commands. tsvc errors happen before records are accessed for reads or writes. They’re counted separately from tsvc timeouts.

UDF

A User-Defined Function (UDF) is code written by a developer that runs inside the Aerospike database server. UDFs can significantly extend the capability of the Aerospike Database engine in functionality and in performance. Aerospike currently only supports Lua as a UDF language.

write block

The storage location for record blocks, also called a streaming write buffer (swb) or wblock. A write block cannot span multiple records, and its size determines the record size limit. The default size is 1 MiB. Write blocks are flushed when full or after the flush-max-ms interval (default 1 second).

write queue

A temporary cache in RAM where write blocks are stored before being written to the storage engine.

xdr

Cross-Datacenter Replication, which asynchronously replicates records across high-latency network links. XDR can replicate full namespaces, sets within namespaces, or specific bins within records.