Skip to main content
Loading

Ingestion and indexing

This page highlights key concepts related to ingesting and indexing new data in Aerospike Vector Search (AVS). AVS updates vector records in real time, but index records are updated asynchronously. This means that records may be available for retrieval right away, but the same records may not appear immediately in search results.

Supported indexes - HNSW

AVS supports Hierarchical Navigable Small World (HNSW) index types only, constructing a multi-layer graph where nodes represent data points and edges connect each node to its nearest neighbors. The nearest neighbor is calculated according to the distance metric chosen for your index and a given search vector embedding. The neighborhood refers to the set of closest nodes to a given node within the graph. The world refers to the entire set of nodes and edges in the graph, representing the high-dimensional space. HNSW optimizes search efficiency by navigating through hierarchical levels, where higher levels contain fewer nodes, simplifying the search process.

Because each record affects other records in its neighborhood, AVS performs HNSW queries during ingestion to pre-hydrate the index cache. These queries are not reported as query requests, but they do show as reads against the storage layer.

Record data updates

As updates and inserts are made into AVS, data goes through several steps for processing the update. First, record data, including the vector, is written to the Aerospike Database (ASDB). You can see your record data immediately in ASDB. To be indexed, each record must contain at least one vector in the specified vector field of an index. Specifying multiple vectors and indexes creates multiple index processes for a single record, but it enables multiple search approaches on the same data.

image

tip

Aerospike recommends that when you upsert a record, you assign it to a specific set. This helps with monitoring and operations.

Index construction

A unique aspect of AVS is its ability to manage index construction across all AVS nodes concurrently. While vector record updates are committed directly to ASDB, index records are processed asynchronously by working with items from an indexing queue. This is done in batches, and index construction is spread across all AVS nodes to maximize the use of CPU cores in your AVS cluster, allowing you to scale up for specific ingestion needs. Keep in mind, ingestion is highly dependent on host memory and storage layer configuration.

For each item in the indexing queue, AVS processes the vector for indexing by assembling the neighborhoods for each vector and committing those to ASDB. An index record contains a copy of the vector itself along with the associated neighbors for that vector at a given layer of the HNSW graph. Index construction takes advantage of advanced vector extensions (AVX), which allows for single instruction, multiple data parallel processing.

tip

You can monitor index construction using the indexing_queue_size metric for monitoring your ingest queue and the requests_metric for monitoring your total indexed records.

image

Index healing

During HNSW index construction, it is important to rebalance the neighborhood graphs regularly. The index healing process runs periodically in the background and performs the following key functions:

  • Account for AVS node outages: Since each node holds in memory a queue of records to be indexed, the healer ensures that those records are indexed, even if an outage occurs.

  • Rebalance the graph: To maintain index quality, it might be necessary to rebalance or rebuild parts of the graph periodically, especially if the addition of new items significantly alters the data distribution. This process, however, can be resource-intensive and might not always be feasible.

  • Index garbage collection: The healer is responsible for deleting index records in the Aerospike Database to free up storage.

tip

You can configure defaults for the healer at the node level and override those configurations at the index level. You can monitor healer performance using the healer_cycle metric.

Waiting for index construction

AVS constructs an index asynchronously as a background process, and in some circumstances it can be helpful to wait for index construction to complete. This functionally is built into the client, and you can monitor it using the indexing_queue_size metric or by reviewing the UNMERGED property of the asvec index list command. It is important to wait for index construction in scenarios where you want to confirm the specific recall quality of search results.