Skip to content

Health check

This page describes how Aerospike’s health check detects health outliers in a cluster, how to configure the health check, and how to write custom health checks.

Health outliers are nodes or devices that deviate from the normal or expected pattern of database activity or performance metrics. Outliers can be detected by exceptionally high or low values that don’t fit the general trend observed in the majority of data points.

Outliers are important in database monitoring because they can indicate hardware failures, network issues, misconfigurations, or resource exhaustion that may be affecting cluster performance.

How health check works

Aerospike’s health check runs independently on every node, maintaining statistics, comparing them every 1 minute, and reporting anomalies if any are detected. Health check requires a minimum 4-node cluster to detect outliers, as it needs multiple data points for statistical comparison.

The Aerospike health check tracks two categories of statistics:

  • Cluster statistics - Track the health of peer nodes in the cluster (4 statistics)
  • Local statistics - Track the health of the node itself (1 statistic)

Maintaining statistics

The challenge in maintaining statistics is the amount of history that needs to be stored. To overcome this, Aerospike uses a sliding window-based method to store the most recent data:

  • The previous 30 minutes of information is maintained for every statistic
  • Statistics are compared every 1 minute to detect anomalies
  • At least 1 minute of statistics must be accumulated before outliers can be detected

Cluster statistics

The following cluster statistics track the health of peer nodes in the cluster. They are maintained for all visible nodes and reflect the health of other nodes.

Number of fabric connections opened

Many fabric connections are attempted when earlier connections have failed or timed out. This can happen when the network between nodes is bad, fabric is misconfigured, or a firewall is restricting traffic. One example is where TLS misconfiguration leads to churn in fabric connections.

Number of node arrivals

This metric tracks how many times a node has successfully established a connection after a period of instability. When the network connection between nodes is intermittent (unreliable) and the heartbeat uses multicast, this count may increase rapidly, reflecting repeated successful reconnection attempts even if the core communication channels (like fabric-opened connections) remain open but degraded.

Number of proxy requests

This metric highlights a node where bad communication or unresponsiveness triggers clients to initiate proxy requests to reach data on that node.

Replica latency

This statistic increases due to two primary issues: either the replica node is running out of resources, such as memory or CPU, or there’s a poor connection between the master and the replica. This does not necessarily indicate an issue with the replica’s storage device, as write operations are, by default, buffered in memory before being flushed to disk unless commit-to-device is true.

Local statistics

Local statistics track the health of the node where the metric is run, rather than peer nodes.

Device read latency

Device read latency is useful only when multiple devices are attached to a namespace. A device is flagged as bad if the read latency to that device is significantly greater than the read latency to other devices in the same namespace. In some cases, one bad device drastically increases the read/write latency of the node.

Statistic thresholds

For a statistic to be part of the outlier detection algorithm, the value of the statistic must be greater than the minimum threshold. The minimum thresholds for different statistics are listed in the following table.

StatisticMin threshold (8.1.0+)Min threshold (Pre-8.1.0)
Number of fabric connections opened605
Number of node arrivals101
Number of proxy requests202
Replica latency00
Device read latency00

Configuration

Health check is a service-level configuration and is disabled by default. You can enable it statically in aerospike.conf or dynamically with asinfo commands.

Static configuration

To enable the health check, add enable-health-check true to the service section of /etc/aerospike/aerospike.conf. To disable the health check, change true to false.

service {
...
enable-health-check true
...
}

Dynamic configuration

Use the following commands to enable or disable health check dynamically on a running node:

asinfo -v "set-config:context=service;enable-health-check=true"
asinfo -v "set-config:context=service;enable-health-check=false"

Info commands

Before you run health-outliers or health-stats, you must enable enable-health-check on all nodes.

The health-outliers command

health-outliers compares metrics for each host to the same metrics on other nodes within the cluster and returns a list of outliers and their information.

Output

Returns a list of tuples:

<id=. . . :confidence-pct=. . . :reason=. . . >

Where:

  • id= Either the node ID or the device ID which is declared as an outlier.

  • confidence-pct= A metric that when high denotes that the statistic is highly skewed.

  • reason= Description of the statistic that is responsible for declaring the ID as an outlier.

Example

asinfo -v health-outliers -l
id=bb9040011ac4202:confidence_pct=100:reason=fabric_connections_opened
id=bb9040011ac4202:confidence_pct=100:reason=proxies
id=bb9040011ac4202:confidence_pct=100:reason=node_arrivals
id=/opt/aerospike/data/bar2.dat:namespace=test:confidence_pct=100:reason=device_read_latency

The health-stats command

health-stats returns a list of statistics maintained by the health check for reference. It is useful as a filter when you need the results as the input array elements of a custom detector.

Output

Returns a list of tuples:

<stat=. . . :value=. . . :[node=. . .:device=. . .:namespace=. . .]>

Where:

  • stat= Identifies the type of statistic. It is a conglomerate of types of statistics and any other unique representatives that define the stat group. See Customize your outlier detector.

  • value+ The current moving average.

The entry stat serves as a group ID, an array of values per stat that can be populated by filtering on stat and using the corresponding entry for value.

Other entries in the tuple are additional information about the stat:

  • node= The node ID.

  • device= The device name. Only one of node or device will appear per tuple, depending on the stat type.

  • namespace= Optional entry that specifies the namespace corresponding to the stat.

Example

asinfo -v health-stats -l
stat=fabric_connections_opened:value=0:node=BB9070011AC4202
. . .
stat=fabric_connections_opened:value=0:node=BB9030011AC4202
. . .
stat=fabric_connections_opened:value=0:node=BB9050011AC4202
. . .
stat=fabric_connections_opened:value=153:node=BB9040011AC4202
. . .
stat=test_device_read_latency:value=1408:device=/opt/aerospike/data/bar1.dat:namespace=test
stat=test_device_read_latency:value=306132:device=/opt/aerospike/data/bar2.dat:namespace=test
stat=test_device_read_latency:value=11630:device=/opt/aerospike/data/bar3.dat:namespace=test

Detecting outliers

Aerospike health check uses Interquartile Range (IQR) with k-factor of 3 to compute the upper boundary for normal behavior.

In addition to the upper bound, health check uses a minimum confidence percentage heuristic as another constraint to prune outliers. Outliers with less than 50% confidence are not reported. When multiple outliers are present, start diagnosing from the outlier with the highest confidence percentage.

False positives

During testing and production use, several scenarios have been identified that could lead to false positives in the outlier detection system.

Low-throughput environments

In a low-throughput setup, a single write operation can trigger a false positive. When a write is sent to one replica, its latency will be non-zero, while all other nodes show zero latency due to inactivity. This discrepancy can cause the system to incorrectly flag the replica with the non-zero latency as an outlier.

Node addition (Pre-8.1.0)

Prior to Database 8.1.0, a newly added node could be incorrectly marked as an outlier because the default threshold for opened fabric connections was 5, when 12 connections are typically opened to a new node upon its arrival in the cluster. Database 8.1.0 increases the connection threshold to 60 to address this issue.

Customize your outlier detector

Develop your own outlier detector for best results in your cluster. An outlier detector accepts an array of values as input and returns a list of outliers. Leverage health-stats output to write a custom detector.

The following steps are an example of how to write a custom detector.

  1. Filter on health-stats and use the results as the input array elements of the custom detector. Treat other details in the statistics as attributes. Start with the following elements of an input array:

    stat=fabric_connections_opened:value=0:node=BB9070011AC4202
    stat=fabric_connections_opened:value=0:node=BB9030011AC4202
    stat=fabric_connections_opened:value=0:node=BB9050011AC4202
    stat=fabric_connections_opened:value=153:node=BB9040011AC4202
    stat=test_device_read_latency:value=1408:device=/opt/aerospike/data/bar1.dat:namespace=test
    stat=test_device_read_latency:value=306132:device=/opt/aerospike/data/bar2.dat:namespace=test
    stat=test_device_read_latency:value=11630:device=/opt/aerospike/data/bar3.dat:namespace=test
  2. Filtering on statistic values forms the following groups:

    Group 1: stat=fabric_connections_opened

    stat=fabric_connections_opened:value=0:node=BB9070011AC4202
    stat=fabric_connections_opened:value=0:node=BB9030011AC4202
    stat=fabric_connections_opened:value=0:node=BB9050011AC4202
    stat=fabric_connections_opened:value=153:node=BB9040011AC4202

    The input to the outlier detector will be [0, 0, 0, 153]

    Group 2: stat=test_device_read_latency

    stat=test_device_read_latency:value=1408:device=/opt/aerospike/data/bar1.dat:namespace=test
    stat=test_device_read_latency:value=306132:device=/opt/aerospike/data/bar2.dat:namespace=test
    stat=test_device_read_latency:value=11630:device=/opt/aerospike/data/bar3.dat:namespace=test

    The input to the outlier detector will be [1408, 11630, 306132]

Feedback

Was this page helpful?

What type of feedback are you giving?

What would you like us to know?

+Capture screenshot

Can we reach out to you?