Ingestion and indexing
Overview
This page highlights key concepts related to ingesting and indexing new data in Aerospike Vector Search (AVS). AVS updates vector records in real time, but index records are updated asynchronously. This means that records may be available for retrieval right away, but the same records may not appear immediately in search results.
Supported indexes - HNSW
AVS supports Hierarchical Navigable Small World (HNSW) index types only, constructing a multi-layer graph where nodes represent data points and edges connect each node to its nearest neighbors. The nearest neighbor is calculated according to the distance metric chosen for your index and a given search vector embedding. The neighborhood refers to the set of closest nodes to a given node within the graph. The world refers to the entire set of nodes and edges in the graph, representing the high-dimensional space. HNSW optimizes search efficiency by navigating through hierarchical levels, where higher levels contain fewer nodes, simplifying the search process.
Because each record affects other records in its neighborhood, AVS performs HNSW queries during ingestion to pre-hydrate the index cache. These queries are not reported as query requests, but they do show as reads against the storage layer.
Record data updates
As updates and inserts are made into AVS, data goes through several steps for processing the update. First, record data, including the vector, is written to the Aerospike Database (ASDB). You can see your record data immediately in ASDB. To be indexed, each record must contain at least one vector in the specified vector field of an index. Specifying multiple vectors and indexes creates multiple index processes for a single record, but it enables multiple search approaches on the same data.
Aerospike recommends that when you upsert a record, you assign it to a specific set. This helps with monitoring and operations.
Index construction
A unique aspect of AVS is its ability to manage index construction across all AVS nodes concurrently. While vector record updates are committed directly to ASDB, index records are processed asynchronously by working with items from an indexing queue. This is done in batches, and index construction is spread across all AVS nodes to maximize the use of CPU cores in your AVS cluster, allowing you to scale up for specific ingestion needs. Keep in mind, ingestion is highly dependent on host memory and storage layer configuration.
For each item in the indexing queue, AVS processes the vector for indexing by assembling the neighborhoods for each vector and committing those to ASDB. An index record contains a copy of the vector itself along with the associated neighbors for that vector at a given layer of the HNSW graph. Index construction takes advantage of advanced vector extensions (AVX), which allows for single instruction, multiple data parallel processing.
You can monitor index construction using the indexing_queue_size metric
for monitoring your ingest queue and the requests_metric
for monitoring your total indexed records.
Index healing
During HNSW index construction, it is important to rebalance the neighborhood graphs regularly. The index healing process runs periodically in the background and performs the following key functions:
Account for AVS node outages: Since each node holds in memory a queue of records to be indexed, the healer ensures that those records are indexed, even if an outage occurs.
Rebalance the graph: To maintain index quality, it might be necessary to rebalance or rebuild parts of the graph periodically, especially if the addition of new items significantly alters the data distribution. This process, however, can be resource-intensive and might not always be feasible.
Index garbage collection: The healer is responsible for deleting index records in the Aerospike Database to free up storage.
You can configure defaults for the healer at the node level and override those configurations at the index level. You can monitor healer performance using the healer_cycle metric.
Waiting for index construction
AVS constructs an index asynchronously as a background process, and in some circumstances it can
be helpful to wait for index construction to complete. This functionally is built into the client, and you can monitor it using the indexing_queue_size metric
or by reviewing the UNMERGED
property of the asvec index list command. It is important to wait for index construction in scenarios where you want to confirm the specific recall quality of search results.