# Health check

This page describes how Aerospike’s health check detects health outliers in a cluster, how to configure the health check, and how to write custom health checks.

Health outliers are nodes or devices that deviate from the normal or expected pattern of database activity or performance metrics. Outliers can be detected by exceptionally high or low values that don’t fit the general trend observed in the majority of data points.

Outliers are important in database monitoring because they can indicate hardware failures, network issues, misconfigurations, or resource exhaustion that may be affecting cluster performance.

## How health check works

Aerospike’s health check runs independently on every node, maintaining statistics, comparing them every 1 minute, and reporting anomalies if any are detected. Health check requires a minimum 4-node cluster to detect outliers, as it needs multiple data points for statistical comparison.

The Aerospike health check tracks two categories of statistics:

-   **Cluster statistics** - Track the health of peer nodes in the cluster (4 statistics)
-   **Local statistics** - Track the health of the node itself (1 statistic)

## Maintaining statistics

The challenge in maintaining statistics is the amount of history that needs to be stored. To overcome this, Aerospike uses a sliding window-based method to store the most recent data:

-   The previous **30 minutes** of information is maintained for every statistic
-   Statistics are compared **every 1 minute** to detect anomalies
-   At least **1 minute** of statistics must be accumulated before outliers can be detected

### Cluster statistics

The following cluster statistics track the health of peer nodes in the cluster. They are maintained for all visible nodes and reflect the health of other nodes.

#### Number of fabric connections opened

Many fabric connections are attempted when earlier connections have failed or timed out. This can happen when the network between nodes is bad, fabric is misconfigured, or a firewall is restricting traffic. One example is where TLS misconfiguration leads to churn in fabric connections.

#### Number of node arrivals

This metric tracks how many times a node has successfully established a connection after a period of instability. When the network connection between nodes is intermittent (unreliable) and the heartbeat uses multicast, this count may increase rapidly, reflecting repeated successful reconnection attempts even if the core communication channels (like fabric-opened connections) remain open but degraded.

#### Number of proxy requests

This metric highlights a node where bad communication or unresponsiveness triggers clients to initiate proxy requests to reach data on that node.

#### Replica latency

This statistic increases due to two primary issues: either the replica node is running out of resources, such as memory or CPU, or there’s a poor connection between the master and the replica. This does not necessarily indicate an issue with the replica’s storage device, as write operations are, by default, buffered in memory before being flushed to disk unless [commit-to-device](https://aerospike.com/docs/database/reference/config#namespace__commit-to-device) is `true`.

### Local statistics

Local statistics track the health of the node where the metric is run, rather than peer nodes.

#### Device read latency

Device read latency is useful only when multiple devices are attached to a namespace. A device is flagged as bad if the read latency to that device is significantly greater than the read latency to other devices in the same namespace. In some cases, one bad device drastically increases the read/write latency of the node.

### Statistic thresholds

For a statistic to be part of the outlier detection algorithm, the value of the statistic must be greater than the minimum threshold. The minimum thresholds for different statistics are listed in the following table.

| Statistic | Min threshold (8.1.0+) | Min threshold (Pre-8.1.0) |
| --- | --- | --- |
| Number of fabric connections opened | 60 | 5 |
| Number of node arrivals | 10 | 1 |
| Number of proxy requests | 20 | 2 |
| Replica latency | 0 | 0 |
| Device read latency | 0 | 0 |

## Configuration

Health check is a service-level configuration and is **disabled by default**. You can enable it statically in `aerospike.conf` or dynamically with `asinfo` commands.

::: note
The `enable-health-check` configuration enables both cluster statistics and local statistics. There is no separate configuration for each category.
:::

### Static configuration

To enable the health check, add `enable-health-check true` to the service section of `/etc/aerospike/aerospike.conf`. To disable the health check, change `true` to `false`.

```plaintext
service {

  ...

  enable-health-check true

  ...

}
```

### Dynamic configuration

Use the following commands to enable or disable health check dynamically on a running node:

```txt
asinfo -v "set-config:context=service;enable-health-check=true"

asinfo -v "set-config:context=service;enable-health-check=false"
```

## Info commands

Before you run `health-outliers` or `health-stats`, you must enable [`enable-health-check`](https://aerospike.com/docs/database/reference/config#service__enable-health-check) on all nodes.

### The health-outliers command

[`health-outliers`](https://aerospike.com/docs/database/reference/info#health-outliers) compares metrics for each host to the same metrics on other nodes within the cluster and returns a list of outliers and their information.

#### Output

Returns a list of tuples:

`<id=. . . :confidence-pct=. . . :reason=. . . >`

Where:

-   **id=** Either the node ID or the device ID which is declared as an outlier.
    
-   **confidence-pct=** A metric that when high denotes that the statistic is highly skewed.
    
-   **reason=** Description of the statistic that is responsible for declaring the ID as an outlier.
    

#### Example

```plaintext
asinfo -v health-outliers -l

id=bb9040011ac4202:confidence_pct=100:reason=fabric_connections_opened

id=bb9040011ac4202:confidence_pct=100:reason=proxies

id=bb9040011ac4202:confidence_pct=100:reason=node_arrivals

id=/opt/aerospike/data/bar2.dat:namespace=test:confidence_pct=100:reason=device_read_latency
```

### The health-stats command

[`health-stats`](https://aerospike.com/docs/database/reference/info#health-stats) returns a list of statistics maintained by the health check for reference. It is useful as a filter when you need the results as the input array elements of a custom detector.

#### Output

Returns a list of tuples:

`<stat=. . . :value=. . . :[node=. . .:device=. . .:namespace=. . .]>`

Where:

-   **stat=** Identifies the type of statistic. It is a conglomerate of types of statistics and any other unique representatives that define the stat group. See [Customize your outlier detector](#customize-your-outlier-detector).
    
-   **value+** The current moving average.
    

The entry stat serves as a group ID, an array of values per stat that can be populated by filtering on stat and using the corresponding entry for value.

Other entries in the tuple are additional information about the stat:

-   **node=** The node ID.
    
-   **device=** The device name. Only one of node or device will appear per tuple, depending on the stat type.
    
-   **namespace=** Optional entry that specifies the namespace corresponding to the stat.
    

#### Example

```plaintext
asinfo -v health-stats -l

stat=fabric_connections_opened:value=0:node=BB9070011AC4202

. . .

stat=fabric_connections_opened:value=0:node=BB9030011AC4202

. . .

stat=fabric_connections_opened:value=0:node=BB9050011AC4202

. . .

stat=fabric_connections_opened:value=153:node=BB9040011AC4202

. . .

stat=test_device_read_latency:value=1408:device=/opt/aerospike/data/bar1.dat:namespace=test

stat=test_device_read_latency:value=306132:device=/opt/aerospike/data/bar2.dat:namespace=test

stat=test_device_read_latency:value=11630:device=/opt/aerospike/data/bar3.dat:namespace=test
```

## Detecting outliers

Aerospike health check uses Interquartile Range (IQR) with k-factor of 3 to compute the upper boundary for normal behavior.

In addition to the upper bound, health check uses a minimum confidence percentage heuristic as another constraint to prune outliers. Outliers with less than 50% confidence are not reported. When multiple outliers are present, start diagnosing from the outlier with the highest confidence percentage.

### False positives

During testing and production use, several scenarios have been identified that could lead to false positives in the outlier detection system.

### Low-throughput environments

In a low-throughput setup, a single write operation can trigger a false positive. When a write is sent to one replica, its latency will be non-zero, while all other nodes show zero latency due to inactivity. This discrepancy can cause the system to incorrectly flag the replica with the non-zero latency as an outlier.

### Node addition (Pre-8.1.0)

Prior to Database 8.1.0, a newly added node could be incorrectly marked as an outlier because the default threshold for opened fabric connections was 5, when 12 connections are typically opened to a new node upon its arrival in the cluster. Database 8.1.0 increases the connection threshold to 60 to address this issue.

## Customize your outlier detector

Develop your own outlier detector for best results in your cluster. An outlier detector accepts an array of values as input and returns a list of outliers. Leverage **health-stats** output to write a custom detector.

The following steps are an example of how to write a custom detector.

1.  Filter on `health-stats` and use the results as the input array elements of the custom detector. Treat other details in the statistics as attributes. Start with the following elements of an input array:
    
    ```plaintext
    stat=fabric_connections_opened:value=0:node=BB9070011AC4202
    
    stat=fabric_connections_opened:value=0:node=BB9030011AC4202
    
    stat=fabric_connections_opened:value=0:node=BB9050011AC4202
    
    stat=fabric_connections_opened:value=153:node=BB9040011AC4202
    
    stat=test_device_read_latency:value=1408:device=/opt/aerospike/data/bar1.dat:namespace=test
    
    stat=test_device_read_latency:value=306132:device=/opt/aerospike/data/bar2.dat:namespace=test
    
    stat=test_device_read_latency:value=11630:device=/opt/aerospike/data/bar3.dat:namespace=test
    ```
    
2.  Filtering on statistic values forms the following groups:
    
    **Group 1: stat=fabric\_connections\_opened**
    
    ```plaintext
    stat=fabric_connections_opened:value=0:node=BB9070011AC4202
    
    stat=fabric_connections_opened:value=0:node=BB9030011AC4202
    
    stat=fabric_connections_opened:value=0:node=BB9050011AC4202
    
    stat=fabric_connections_opened:value=153:node=BB9040011AC4202
    ```
    
    The input to the outlier detector will be `[0, 0, 0, 153]`
    
    **Group 2: stat=test\_device\_read\_latency**
    
    ```plaintext
    stat=test_device_read_latency:value=1408:device=/opt/aerospike/data/bar1.dat:namespace=test
    
    stat=test_device_read_latency:value=306132:device=/opt/aerospike/data/bar2.dat:namespace=test
    
    stat=test_device_read_latency:value=11630:device=/opt/aerospike/data/bar3.dat:namespace=test
    ```
    
    The input to the outlier detector will be `[1408, 11630, 306132]`