Skip to main content

Key Metrics to Monitor

Aerospike recommends that you monitor the metrics listed here.

For the complete list of metrics, see the Metric Reference.

Operating system and server health

In addition to monitoring Aerospike metrics you should also monitor system metrics with the Prometheus Node Exporter or an OS-specific tool.

Finding total namespace memory

The metric memory_used_bytes was removed in Database 7.0 to streamline configuration and capacity planning, and to stabilize overhead so that memory usage calculations are more accurate. In Database 7.0, no single metric reports the amount of memory used in the namespace. A combination of items provide the same information as memory_used_bytes.

You allocated a specific amount of storage for your namespace when you created it. You also set a limit in the system-memory-pct parameter that tells Aerospike when the memory is full enough to stop writing to the namespace. Before you reach that limit, you can determine the total memory used in the namespace by adding the following individual metrics. Depending on which of the following is stored in memory in your namespace, add up the values to get the total used memory bytes:

  • data_used_bytes
  • index_used_bytes
  • set_index_used_bytes
  • sindex_used_bytes

You may also run the info namespace command in the Aerospike Admin (asadm) tool. See Aerospike Admin - Info namespace for more information.

Admin> info namespace
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2023-10-13 15:59:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace| Node|Evictions| Stop|~System Memory~|~Primary Index~~|~~Secondary~~~|~~~~~~~~~~~~~~~~~~~Storage Engine~~~~~~~~~~~~~~~~~~
| | |Writes| Avail%| Evict%| Type| Used|~~~~Index~~~~~| Type| Used| Used%|Evict%| Used|Avail%|Avail
| | | | | | | | Type| Used| | | | | Stop%| |Stop%
bar |172.17.0.3:3000| 0.000 |False | 74| 0|shmem|625.000 KB|shmem|0.000 B |memory|625.000 KB|0.06 %| 0.0 %|70.0 %|97.0 %|5.0 %
bar | | 0.000 | | | | |625.000 KB| |0.000 B | |625.000 KB|0.06 %| | | |
test |172.17.0.3:3000| 0.000 |False | 74| 0|shmem|625.000 KB|shmem|0.000 B |memory|625.000 KB|0.06 %| 0.0 %|70.0 %|87.0 %|5.0 %
test | | 0.000 | | | | |625.000 KB| |0.000 B | |625.000 KB|0.06 %| | | |
Number of rows: 2

aerospike_namespace_clock_skew_stop_writes

Context: namespace

Introduced: 4.0

If clock_skew_stop_writes is true, it is a critical ALERT.

Verify that clocks are synchronized across the cluster.

aerospike_namespace_data_avail_pct

Context: namespace

Introduced: 7.0

aerospike_namespace_dead_partitions

Context: namespace

Introduced: 4.0

If dead_partitions is not zero, critical ALERT. If you are certain that there are no potential data inconsistencies or if data inconsistencies are acceptable, consider issuing revive and recluster commands.

aerospike_namespace_hwm_breached

Context: namespace

Introduced: 3.9

If hwm_breached is true, alert your operations group that memory or disk resources are strained. This condition might indicate the need to increase cluster capacity.

aerospike_namespace_stop_writes

Context: namespace

Introduced: 3.9

If stop-writes is true, critical ALERT.

Until the cause is corrected, the system will reject all writes.

aerospike_namespace_unavailable_partitions

Context: namespace

Introduced: 4.0

IF unavailable_partitions is not zero, critical ALERT.

Check for network issues and make sure the cluster forms properly.

aerospike_node_stats_client_connections

Context: node_stats

Introduced: -

  • If client_connections is below an expected low value, then this condition might indicate a problem with the network between clients and server.

  • If client_connections is greater than an expected high value, then this condition might indicate a problem with clients rapidly opening and closing sockets.

  • If client_connections is at or near proto_fd_max, then the server is either currently unable to accept new connections or might soon be unable to do so.

aerospike_node_stats_client_connections_opened

Context: node_stats

Introduced: 5.6

If client_connections_opened changes unexpectedly without clients having been added or removed, or a significant change in workload having occurred, this condition might indicate a slow down on a node or a connectivity issue on the node.

aerospike_node_stats_cluster_size

Context: node_stats

Introduced: -

If cluster_size does not equal the expected cluster size and the cluster is not undergoing maintenance, your operations group needs to investigate.

aerospike_node_stats_fabric_connections_opened

Context: node_stats

Introduced: 5.6

If fabric_connections_opened is unexpectedly changing, alert as this condition would indicate a connectivity problem with a node or a cluster change.

aerospike_node_stats_heartbeat_connections_opened

Context: node_stats

Introduced: 5.6

If heartbeat_connections_opened is unexpectedly changing, alert as this condition would indicate a connectivity problem with a node or a cluster change.

aerospike_node_stats_system_free_mem_kbytes

Context: node_stats

Introduced: -

If system_free_mem_kbytes is abnormally low, could indicate the server is approaching the limits of the available RAM. Operations should investigate and potentially add nodes or increase per node RAM.

aerospike_node_stats_system_free_mem_pct

Context: node_stats

Introduced: -

If system_free_mem_pct is abnormally low, could indicate the server is approaching the limits of the available RAM. Operations should investigate and potentially add nodes or increase per node RAM.

aerospike_xdr_lag

Context: xdr

Introduced: 5.0.0

If lag is consistently greater than a few seconds, this condition might indicate network connectivity issues or errors writing at a destination cluster.<br /

Other metrics to watch

aerospike_namespace_client_delete_error

Context: namespace

Introduced: 3.9

Compare client_delete_error to client_delete_success.

If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_client_read_error

Context: namespace

Introduced: 3.9

Compare client_read_error to client_read_success.

If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_client_udf_error

Context: namespace

Introduced: 3.9

Compare client_udf_error to client_udf_complete.

If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_client_write_error

Context: namespace

Introduced: 3.9

Compare client_write_error to client_write_success.

If ratio is higher than acceptable,alert operations to investigate.

For more details, see to the knowledge base article Why is my client_write_error metrics incrementing?.

aerospike_namespace_index_flash_alloc_pct

Context: namespace

Introduced: 5.6

If index_flash_alloc_pct gets close to or greater than 100%, alert operations to review the sizing of the namespace.

aerospike_namespace_pi_query_aggr_error

Context: namespace

Introduced: 6.0

Compare pi_query_aggr_error to pi_query_aggr_complete.

If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_pi_query_long_basic_error

Context: namespace

Introduced: 6.0

Compare pi_query_long_basic_error to pi_query_long_basic_complete.

If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_pi_query_ops_bg_error

Context: namespace

Introduced: 6.0

Compare pi_query_ops_bg_error to pi_query_ops_bg_complete and If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_pi_query_short_basic_error

Context: namespace

Introduced: 6.0

Compare pi_query_short_basic_error to pi_query_short_basic_complete.

If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_pi_query_udf_bg_error

Context: namespace

Introduced: 6.0

Compare pi_query_udf_bg_error to pi_query_udf_bg_complete.

If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_pi_query_udf_bg_error

Context: namespace

Introduced: 6.0

Compare pi_query_udf_bg_error to pi_query_udf_bg_complete.

If ratio is higher than acceptable, alert operations to investigate.

aerospike_namespace_storage_engine_device_defrag_q

Context: namespace

Introduced: 4.3

Measured per-device or per-file depending on the storage configuration.

If storage-engine.device[ix].defrag_q or storage-engine.file[ix].defrag_q continues to increase over time, alert operations to investigate.

aerospike_namespace_storage_engine_file_write_q

Context: namespace

Introduced: 4.3

Measured per-device or per-file depending on the storage configuration.

If storage-engine.device[ix].write_q or storage-engine.file[ix].write_q is greater than 1, alert operations to investigate.

aerospike_node_stats_batch_index_error

Context: node_stats

Introduced: 3.9

Compare batch_index_error to batch_index_complete. If ratio is higher than acceptable, alert Operations to investigate.

aerospike_node_stats_heap_efficiency_pct

Context: node_stats

Introduced: 3.10.1

If heap_efficiency_pct goes below 60% or 50% (depending on configuration, advise your operations group to investigate.

aerospike_node_stats_rw_in_progress

Context: node_stats

Introduced: 3.9

Depends on expected workload.

If rw_in_progress is higher than expected, or if this deviates more than acceptable from the established baseline over time,alert operations to investigate the cause. May indicate a slowdown on a particular node or overloading on the fabric.

aerospike_xdr_abandoned

Context: xdr

Introduced: 5.0.0

If abandoned is consistently higher than expected alert operations to investigate.

aerospike_xdr_lap_us

Context: xdr

Introduced: 5.0.0

If lap_us is consistently higher than expected alert operations to investigate.

aerospike_xdr_latency_ms

Context: xdr

Introduced: 5.0.0

Depending on configuration, latency_ms should be within the latency of the link between the DCs.

If latency_ms increases beyond the expectations based on the distance (or known link latency) between clusters, alert operations to investigate.

aerospike_xdr_recoveries

Context: xdr

Introduced: 5.0.0

If recoveries is consistently increasing alert operations to investigate.

aerospike_xdr_recoveries_pending

Context: xdr

Introduced: 5.0.0

If recoveries_pending is unexpectedly increasing alert operations to investigate.

aerospike_xdr_retry_conn_reset

Context: xdr

Introduced: 5.0.0

If retry_conn_reset is consistently higher than expected alert operations to investigate.

aerospike_xdr_retry_dest

Context: xdr

Introduced: 5.0.0

If retry_dest is consistently higher than expected alert operations to investigate.

aerospike_xdr_retry_no_node

Context: xdr

Introduced: 5.1

If retry_no_node is consistently higher than expected alert operations to investigate.

aerospike_xdr_success

Context: xdr

Introduced: 5.0.0

If success is consistently lower than expected alert operations to investigate.