Primary index (PI) queries
Overviewโ
This page describes how to query entire data sets in the Aerospike database without blocking other database operations.
Prior to Database 6.0, primary index (PI) queries were called scans and secondary index (SI) queries were called queries. See the Queries feature guide.
Common uses of PI queriesโ
You can use PI queries to:
- Retrieve all or a specified count of records in a namespace or set. This is called a read-only PI query.
- Filter for records that have been updated since a specific Last Update Time (LUT).
- Do regular database maintenance by querying all records in a set or namespace and selectively updating records with a User-Defined Function (UDF) or an array of multi-ops. This is called a read-write PI query (or background query).
Applications can send query requests to all partitions in the cluster, to specific partitions, or to specific digests within partitions.
Read-only PI queriesโ
A client application executes a command to start a read-only query. This command initiates parallel requests to each node in the cluster. As the query iterates through each partition, it returns the current version of each record to the client.
Many database tasks such as index creation and backups also use data scan as the underlying mechanism.
Read-only PI queries have the following features:
- Filter records by set name.
- Filter records by Filter Expressions
such as
last-update-time > X
. - Filter by count. The servers return the specified number of records. This can be used for pagination.
- Return only record digests and metadata (generation and TTL).
- Return specified bins.
Background read-write queryโ
A client application can also issue an asynchronous background query to the database and apply either a Lua UDF or a series of write multi-ops to each record. This is more efficient than the client-side query for cases where data needs to be manipulated. Multi-ops are typically more efficient than using Lua UDFs because the server doesn't need to translate internal objects to another language. Many client libraries also provide an API to poll for the completion of a background query.
Like read-only PI queries, read-write PI queries are often used for database maintenance,
and can rely on arbitrary rules for grooming your data. For example, you can use
a UDF to compare the last_visited
value of a record to some specified
date/time. If the value is too old, which implies that the record has not
been updated for a long time, the application can delete the record. The
application can apply a combination of such rules to fine-tune the query.
The application can also use generic grooming functions and pass parameters when the query executes. This approach is powerful because cleanup processing is done as close to the data source as possible.
Read-write PI queries have the following features:
- Filter records by set name.
- Filter records by Filter Expressions
such as
last-update-time > X
. - Read/Update using a User-Defined Function (UDF).
- Update using write multi-ops; that is, operations on bins.
In the event of disruptions to the cluster, a background query might not process all of the records.
Quotas on PI queriesโ
Rate quotas, introduced in Aerospike Enterprise Edition 5.6, can be used to limit the disk IO performed by a specific user, including their query operations. See Rate Quotas for more.
Client application examplesโ
The Aerospike Developer Hub contains client library code examples for all supported client libraries.
Related documentationโ
Historical evolution of scan featuresโ
Database 6.0 changesโ
- The query and scan subsystems were unified. Scans are deprecated in the clients, and the Query API handles both types, primary index (PI) queries (AKA scans) and secondary index (SI) queries.
Database 5.6 changesโ
- Optional configurable set indexes added to speed up PI queries of sets which are small compared to their namespace.
Database 4.9 changesโ
- Read/Write multi-op support added to background read-write PI queries.
- PI queries are issued per partition instead of per node. This resolves the issue where a scan could return duplicated records or not return some records during cluster state changes. Clients use the 'scan per partition' feature to make sure each partition is processed exactly one time, even during cluster change events where partition ownership shifts between cluster nodes.
See Manage Scan for more information.