Aerospike Vector is now live with LangChain integrationTell me more
Blog

Query JSON Documents Faster (and More) with New CDT Indexing

neel-phadnis-20ed9332d626250e23b8226a5208e50c
Neel Phadnis
Director - Developer Ecosystem
September 1, 2022|10 min read

Source: Photo by Cameron Ballard on Unsplash

The Collection Data Types (CDTs) in Aerospike are List and Map. They offer powerful capabilities to model and access your data for speed-at-scale. A major use of the CDTs is to store and process JSON documents efficiently. In the recent Aerospike Database 6.1 release, secondary index capabilities over the CDTs have been enhanced to make the CDTs even more useful and powerful for JSON documents in addition to other uses.

Understanding Context-Path and Path Specifiers

A CDT element is identified by its context-path. A CDT element’s context-path is defined as the path from the root to the element.

A context-path is very similar to JSONPath in a JSON document, but differs from JSONPath in some respects. Like JSONPath, a context-path describes the path from the root or the top level of the CDT to a nested element.

While a node in a JSONPath is an index in an array or a field in a map (object), each node in a context-path, on the other hand, uniquely identifies an element at that level by one of the following specifiers:

  • Index (physical position, 0 indexed)

  • Key (applicable only to a Map)

  • Rank (relative value position, with 0 being the lowest and -1 being the highest)

  • Value (the first element with that value)

So a context path is a concatenation of specifiers that identify path nodes. For example, consider a Map:

{k1”: 1, “k2”: 2, “k3”: [11, 12, 13], “k4”: {“k11”: 11, “k22”: 22}}

The context path for the value 22 is: By-Key(“k4”), By-Key(“k22”) or By-Key(“k4”), By-Value(22). The JSONPath for it is: $.k4.k22.

Consider another nested object represented as a Map at the top level (level 0): it has a List at level 1, and a Map at level 2.

Object = {  “id1”: [ {“a”: 1, “b”: 2}, {“c”: 3, “d”: 4} ],
            “id2”: [ {“e”: 5, “f”: 6}, {“g”: 7, “h”: 8}] }

A context path to the nested element “c”, can look like: By-Key(“id1”), By-Index(1), By-Key(“c”). The JSONPath of “c” in the corresponding JSON document looks very similar: $.id1[1].c.

Note, however, that “c” can be reached using other context-paths, such as, By-Index(0), By-Rank(1), By-Value(3). There are no alternative contiguous JSONPaths to “c”. Also, a JSONPath can skip a node and can point to more than one element in a JSON document. For example, $.id1..c will point to all “c” nodes below “id1”. A context-path does not allow an interim node to be skipped and cannot point to more than one element in a CDT. So a similar construct By-Key(“id1”)..By-Key(“c”) is not supported.

New CDT Indexing Capabilities

In a nutshell, in Aerospike Database 6.1 and later, any CDT element can be indexed irrespective of their nesting level. Specifically, there are two main new capabilities to highlight:

  1. Elements in a CDT can now be indexed based on their "index” (meaning physical position of the element in CDT), key, rank, or value. So it is now possible to create a secondary index, say, on the element at rank -1 (the highest value) of a List, so that the equality and range queries for the highest value in the List across records can be efficiently executed. For example, retrieve all users that have a personal best score greater than 100 from their lifetime scores List.

  2. Embedded List or Map can now be indexed, So in a List bin with value

    [1, 2, 3, “s1”, “s2”, “s3”, [11, 12 13]]

    , the embedded List

    [11, 12, 13]

    can now be indexed. Many complex objects, especially JSON documents, have deep hierarchies. Now, the elements below the top level can be indexed and efficiently queried on. So in a Map bin

    {k1”: 1, “k2”: 2, “k3”: [11, 12, 13], “k4”: {“k11”: 11, “k22”: 22}}

    , a secondary index can be created on the List

    “k3”

    , the Map

    “k4”

    , as well as all the elements within them. For example, this makes it possible to retrieve all records with a value 31 in the

    “k3”

    list or records with the value of

    “k11”

    in the range 10-20.

The following types of secondary indexes can be created on a CDT element. Note, since a given CDT element can hold a value of any type, only values of the specified type are indexed.

  • Non-collection type: Index an element for values of one of the following types.

    • Integer

    • String

    • Geospatial

  • List collection type: Index List values of an element. All values within a List of one of the following types are indexed.

    • Integer List values

    • String List values

    • Geospatial List values

  • Map collection type: Index Map values of an element. All Map keys or Map values of one of the following types are indexed.

    • Integer Map keys

    • String Map keys

    • Integer Map values

    • String Map values

    • Geospatial Map values

Also new in 6.1 is the ability to index all namespace records, in addition to the previously supported set specific indexing. Thus, an index can be specified on a CDT element across the namespace for querying records from the entire namespace.

Many Indexes on Same Element

In many cases, it is required to create multiple indexes on the same CDT element.

Multiple Types

Multiple indexes of different types are allowed on the same context-path in order to index the respective data type values.

As CDTs do not conform to a schema, an element can be of any type. A secondary index is defined for values of a specific type, and only considers values of that type. Other type values at that context path will be simply ignored. Thus, an integer index at a Map key or List rank will only index integer values at that path. So in a record with List [1, 2, 3, “s1”, “s2”, “s3”], in order to select the record on equality query on “s3”, a string index must be present on all List string values or a specific string element using index, rank, or value in the List.

JSON documents are saved as CDTs, but are simpler as they have single type values in a List (numeric or string) and Map keys (string), and therefore will need only one index value type on these collection types. Map values, however, can be mixed (string or numeric).

Multiple Paths

The same element in a CDT can be arrived at in multiple ways via different context paths using different specifiers. For example, an element at Map index X may also be the rank Y and key Z. Multiple indexes of the same value type therefore are possible on the same static element. However, they are semantically different.

Important: As values in a CDT change, the context paths that were pointing to the same element may no longer point to one or the same element. Queries based on indexes defined using different context paths pointing to the same static element in one CDT can yield different results. The application should define the indexes and queries carefully with data and application semantics in mind.

Querying Values Across Multiple Elements

In some cases, a List or Map is not readily available for indexing, as the elements may be distributed in multiple places in the CDT. An example is a List of Maps, where we want to find out if a specific key in any Map has a specific value. Such an array (List) of objects (Maps) is a common occurrence in JSON documents, and querying a specific object field for a value may be necessary. For a concrete example, consider the JSON document of Nobel laureates in a year and category :

{
  "year" : "2021",
  "category" : "chemistry",
  "laureates" : [ {
     "id" : "1002",
     "name" : "Benjamin List"
    }, {
     "id" : "1003",
     "name" : "David MacMillan"
  } ]
}

In order to issue a query “find the Nobel prize(s) by the winner’s name”, the name field in all objects in the laureates List need to be indexed across such records.

While such a collection type is not directly supported for indexing, the following solution can be implemented: Create a separate List of names of laureates in the record, and index that List.

laureate_names_in_record: ["Benjamin List", "David MacMillan", ..]

It is then possible to find the Nobel prize(s) by the winner's name.

For data that does not change, such as the names of Nobel laureates in the above example, this is relatively straightforward. In order to index values in a record that can change, the indexed List must be kept in sync with the changing values. Depending on where the values are in the CDT and how values are updated, it may be possible to update a value and its index List entry together atomically using the multi-op CDT operations.

Examples

In order to make it clearer, we will describe with examples. You can follow along in the interactive notebook CDT Indexing.

Examples are shown for two categories of CDT index below:

  1. Non-collection index

  2. Collection index

    • LIST (all values in the List) for a List, or

    • Either MAP_KEYS (all keys) or MAP_VALUES(all values) for a Map

The CDT model is a superset JSON, and therefore the examples described in CDT terminology below are also applicable to JSON documents.

Non-collection Index

A non-collection index supports equality queries on integer and string values with equal filter, and range queries on integer values with the range filter.

Equality queries

Get records with a specific integer or string value:

  • At a specific index or rank position of a List or a Map

    • records with 100/“ABC” at index 0 of a list or a map

    • records with 100/”ABC” at rank -1 (highest value) of a list or a map

  • At a specific key position of a Map

    • records with 100/“ABC” at key XYZ of a map

Range queries

Range queries are supported on integer values only.

Get records having an integer value within a range:

  • At a specific index or rank position of a List or a Map

    • records with value in range 1-100 at index 0 of a list or a map

    • records with value in range 1-100 at rank -1 (highest value) of a list or a map

  • At a specific key position of a Map

    • records with value in range 1-100 at key XYZ of a map

Collection Index

A collection index supports equality queries on integer and string values with the contains filter, and range queries on integer values with the range filter. The collection type is LIST, MAP_KEYS, or MAP_VALUES in the createIndex Java API.

Equality queries

Get records with a specific integer or string value.

  • In a List

    • records with a list containing 100/“ABC”

  • In a Map’s keys

    • records with a map containing 100/“ABC” as a key

  • In a Map’s values

    • records with a map containing 100/“ABC” as a value

Range queries

Range queries are supported on integer values only.

  • In a List

    • records with a list containing a value in range 1-100

  • In a Map’s keys

    • records with a map containing a key in range 1-100

  • In a Map’s values

    • records with a map containing a value in range 1-100

Takeaways

In Aerospike Database 6.1+, any CDT element can be indexed irrespective of their nesting level. Elements in a CDT can also be indexed based on their position, key, rank, or value. A CDT element can have multiple context-paths with different semantics, and therefore the application should carefully determine the correct context-path while defining an index, with data and application semantics in mind. When values that need to be indexed are not available in one List or Map, consider replicating the values in a separate List for defining an index, and keep the indexed List in sync with the values. View and work on the examples in the notebook CDT Indexing and the Aerospike Sandbox.