Skip to main content
Loading
Version: Graph 2.4.2

Supernodes

Overviewโ€‹

This page describes what supernodes in the Aerospike Graph Service (AGS) and how to designate and manage them.

What is a supernode?โ€‹

A supernode is a vertex with a disproportionately high number of incoming or outgoing edges. The exact number of edges which make a vertex into a supernode depends on the storage engine configuration used by the Aerospike database associated with AGS, and on the max-record-size configuration option value.

Under the Hybrid Memory Model (the default for Aerospike database namespaces):

  • If max-record-size is set to 1MiB (the default), any vertex with approximately 6,500 or more edges is a supernode.

  • If max-record-size is set to 128KiB, any vertex with approximately 800 or more edges is a supernode.

For in-memory namespaces:

  • If max-record-size is set to 8MiB, any vertex with approximately 50,000 or more edges is a supernode.

Since supernodes are connected to so many other vertices in the graph, traversing supernodes may lead to performance problems due to their highly interconnected nature. The existence of supernodes and, more importantly, traversing over supernodes should be a conscious decision when modeling your data.

Designating supernodesโ€‹

AGS makes a clear distinction between regular vertices and supernodes. Regular vertices maintain inline record edge lists (adjacency lists) to optimize database lookups and improve query performance. Supernodes are maintained in multi-record edge lists that allow for lazy composition of edges as necessary for traversal.

The ~supernode flag is a virtual property that can be used to denote that a vertex is or will become a supernode. Set this manually when you know that a vertex will be a supernode, so that AGS doesn't populate the record edge lists for a vertex which can't optimally use it.

In addition to the ~supernode flag being set manually, AGS automatically assigns this flag to vertices that become supernodes through the addition of many edges.

Exampleโ€‹

The following examples demonstrate how to set the ~supernode flag on a newly created vertex. The vertex test is internally marked as a supernode and the record edge list is ignored.

Vertex v = g.addV("test").next();
g.V(v.id()).property("~supernode", true).iterate();

Considerationsโ€‹

  • Once the ~supernode flag is set, it cannot be unset. It remains in place for the duration of the life of the vertex.

  • You cannot read the value of the ~supernode flag. If you read it back, it returns nothing.

  • The value you assign to the flag doesn't matter. Any value assigned to the ~supernode flag is treated as true.

Filtering out supernodesโ€‹

When composing queries, you can check for the ~supernode property on reading to filter them out. The property has no value, so you can only check for its existence.

g.V().hasNot("~supernode").outE().inV() // <- Correct usage
g.V().has("~supernode", false).outE().inV() // <- Incorrect
g.V().hasNot("~supernode", true).outE().inV() // <- Also incorrect

Traversing the edges of supernodesโ€‹

To optimize query performance, include property filters when traversing supernode vertices. These filters reduce the query scope, minimizing data retrieval from the storage layer to AGS and improving performance.

Example:

g.V().hasLabel("potentialSupernodes").outE().has("propertyFoo", "valueFoo").inV()

AGS supports equality comparisons on strings and numbers (integers and longs):

  • P.eq (=)

and the following comparison operators for numbers:

  • P.gt (<)
  • p.gte (<=)
  • p.lt (>)
  • p.lte (>=)

Limits on compound predicatesโ€‹

Compound predicates such as within, between, and and/or degrade query performance. Wherever possible, expand your .has() step to use multiple single predicates rather than compound predicates. Using multiple .has() steps, each with single predicates, results in better query performance than a single .has() step which uses compound predicates.

g.V().hasLabel("potentialSupernodes").outE().has("foo", P.within(1, 5)).inV() // Compound predicate - optimization won't apply
g.V().hasLabel("potentialSupernodes").outE().has("foo", P.gte(1)).has("foo", P.lt(5)).inV() // Equivalent single predicates - optimized

Log warning for supernode traversalsโ€‹

An unoptimized query traversing over supernodes may result in drastically degraded performance because the query may retrieve a large number of outgoing/incoming vertices from the database.

To assist in diagnosing performance issues, AGS logs queries in which a traversal encounters one or more supernodes and identifies the ID of the first supernode.

Example log entry:

12:00:00.000 [main] WARN  c.a.f.p.t.step.util.TraversalUtil - The traversal, ".V().hasLabel("potentialSupernode").outE()", walks over the Edges of an existing supernode in the Graph which may cause unexpected performance.
Consider adjusting the traversal to filter out supernode Vertices or adding filters to the Edges if required.
ID of first supernode Vertex encountered by this traversal: "supernode1"
note

The log does not record every occurrence of identical supernode traversals. After an initial log warning, the log records another warning for the same traversal after 10 additional different supernode traversals occur.

Supernode traversal warnings in the server log are enabled by default. To turn off, set the property key aerospike.graph.log.supernode.warning to false.

Bulk loading considerationsโ€‹

When loading datasets larger than 1TB with the bulk loader, the Spark driver may be overwhelmed when attempting to calculate supernodes. To prevent this, enable sampling in the Spark driver to reduce memory and compute resources required for supernode calculation. The aerospike.graphloader.supernode.sampling-percentage configuration option controls supernode sampling for bulk loading operations.

Percentage values are represented as floating-point values between 0 and 1.0. For example:

aerospike.graphloader.supernode.sampling-percentage=0.1

sets the sampling percentage to 10% of the dataset.

aerospike.graphloader.supernode.sampling-percentage=0.01

sets the sampling percentage to 1% of the dataset.