Key Metrics to Monitor
Aerospike recommends that you monitor the metrics listed here.
For the complete list of metrics, see the Metric Reference.
Operating system and server health
In addition to monitoring Aerospike metrics you should also monitor system metrics with the Prometheus Node Exporter or an OS-specific tool.
Finding total namespace memory
The metric memory_used_bytes
was removed in Database 7.0 to streamline configuration and capacity planning, and to stabilize overhead so that memory usage calculations are more accurate.
In Database 7.0, no single metric reports the amount of memory used in the namespace. A combination of items provide the same information as memory_used_bytes
.
You allocated a specific amount of storage for your namespace when you created it. You also set a limit in the system-memory-pct
parameter that tells Aerospike when the memory is full enough to stop writing to the namespace.
Before you reach that limit, you can determine the total memory used in the namespace by adding the following individual metrics. Depending on which of the following is stored in memory in your namespace, add up the values to get the total used memory bytes:
data_used_bytes
index_used_bytes
set_index_used_bytes
sindex_used_bytes
You may also run the info namespace
command in the Aerospike Admin (asadm
) tool.
See Aerospike Admin - Info namespace for more information.
Admin> info namespace
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2023-10-13 15:59:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace| Node|Evictions| Stop|~System Memory~|~Primary Index~~|~~Secondary~~~|~~~~~~~~~~~~~~~~~~~Storage Engine~~~~~~~~~~~~~~~~~~
| | |Writes| Avail%| Evict%| Type| Used|~~~~Index~~~~~| Type| Used| Used%|Evict%| Used|Avail%|Avail
| | | | | | | | Type| Used| | | | | Stop%| |Stop%
bar |172.17.0.3:3000| 0.000 |False | 74| 0|shmem|625.000 KB|shmem|0.000 B |memory|625.000 KB|0.06 %| 0.0 %|70.0 %|97.0 %|5.0 %
bar | | 0.000 | | | | |625.000 KB| |0.000 B | |625.000 KB|0.06 %| | | |
test |172.17.0.3:3000| 0.000 |False | 74| 0|shmem|625.000 KB|shmem|0.000 B |memory|625.000 KB|0.06 %| 0.0 %|70.0 %|87.0 %|5.0 %
test | | 0.000 | | | | |625.000 KB| |0.000 B | |625.000 KB|0.06 %| | | |
Number of rows: 2
Recommended alert metrics
Context: namespace
Introduced: 4.0
If clock_skew_stop_writes
is true, it is a critical ALERT.
Verify that clocks are synchronized across the cluster.
Context: namespace
Introduced: 7.0
Context: namespace
Introduced: 4.0
If dead_partitions
is not zero, critical ALERT. If you are certain that there are no potential data inconsistencies or if data inconsistencies are acceptable, consider issuing revive
and recluster
commands.
Context: namespace
Introduced: 3.9
If hwm_breached
is true, alert your operations group that memory or disk resources are strained. This condition might indicate the need to increase cluster capacity.
Context: namespace
Introduced: 3.9
If stop-writes
is true, critical ALERT.
Until the cause is corrected, the system will reject all writes.
Context: namespace
Introduced: 4.0
IF unavailable_partitions
is not zero, critical ALERT.
Check for network issues and make sure the cluster forms properly.
Context: node_stats
Introduced: -
-
If
client_connections
is below an expected low value, then this condition might indicate a problem with the network between clients and server. -
If
client_connections
is greater than an expected high value, then this condition might indicate a problem with clients rapidly opening and closing sockets. -
If
client_connections
is at or nearproto_fd_max
, then the server is either currently unable to accept new connections or might soon be unable to do so.
Context: node_stats
Introduced: 5.6
If client_connections_opened
changes unexpectedly without clients having been added or removed, or a significant change in workload having occurred, this condition might indicate a slow down on a node or a connectivity issue on the node.
Context: node_stats
Introduced: -
If cluster_size
does not equal the expected cluster size and the cluster is not undergoing maintenance, your operations group needs to investigate.
Context: node_stats
Introduced: 5.6
If fabric_connections_opened
is unexpectedly changing, alert as this condition would indicate a connectivity problem with a node or a cluster change.
Context: node_stats
Introduced: 5.6
If heartbeat_connections_opened
is unexpectedly changing, alert as this condition would indicate a connectivity problem with a node or a cluster change.
Context: node_stats
Introduced: -
If system_free_mem_kbytes
is abnormally low, could indicate the server is approaching the limits of the available RAM. Operations should investigate and potentially add nodes or increase per node RAM.
Context: node_stats
Introduced: -
If system_free_mem_pct
is abnormally low, could indicate the server is approaching the limits of the available RAM. Operations should investigate and potentially add nodes or increase per node RAM.
Context: xdr
Introduced: 5.0.0
If lag
is consistently greater than a few seconds, this condition might indicate network connectivity issues or errors writing at a destination cluster.<br /
Other metrics to watch
Context: namespace
Introduced: 3.9
Compare client_delete_error
to client_delete_success
.
If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 3.9
Compare client_read_error
to client_read_success
.
If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 3.9
Compare client_udf_error
to client_udf_complete
.
If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 3.9
Compare client_write_error
to client_write_success
.
If ratio is higher than acceptable,alert operations to investigate.
For more details, see to the knowledge base article Why is my client_write_error metrics incrementing?.
Context: namespace
Introduced: 5.6
If index_flash_alloc_pct
gets close to or greater than 100%, alert operations to review the sizing of the namespace.
Context: namespace
Introduced: 6.0
Compare pi_query_aggr_error
to pi_query_aggr_complete
.
If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 6.0
Compare pi_query_long_basic_error
to pi_query_long_basic_complete
.
If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 6.0
Compare pi_query_ops_bg_error
to pi_query_ops_bg_complete
and If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 6.0
Compare pi_query_short_basic_error
to pi_query_short_basic_complete
.
If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 6.0
Compare pi_query_udf_bg_error
to pi_query_udf_bg_complete
.
If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 6.0
Compare pi_query_udf_bg_error
to pi_query_udf_bg_complete
.
If ratio is higher than acceptable, alert operations to investigate.
Context: namespace
Introduced: 4.3
Measured per-device or per-file depending on the storage configuration.
If storage-engine.device[ix].defrag_q or storage-engine.file[ix].defrag_q continues to increase over time, alert operations to investigate.
Context: namespace
Introduced: 4.3
Measured per-device or per-file depending on the storage configuration.
If storage-engine.device[ix].write_q or storage-engine.file[ix].write_q is greater than 1, alert operations to investigate.
Context: node_stats
Introduced: 3.9
Compare batch_index_error
to batch_index_complete
. If ratio is higher than acceptable, alert Operations to investigate.
Context: node_stats
Introduced: 3.10.1
If heap_efficiency_pct
goes below 60% or 50% (depending on configuration, advise your operations group to investigate.
Context: node_stats
Introduced: 3.9
Depends on expected workload.
If rw_in_progress
is higher than expected, or if this deviates more than acceptable from the established baseline over time,alert operations to investigate the cause. May indicate a slowdown on a particular node or overloading on the fabric.
Context: xdr
Introduced: 5.0.0
If abandoned
is consistently higher than expected alert operations to investigate.
Context: xdr
Introduced: 5.0.0
If lap_us
is consistently higher than expected alert operations to investigate.
Context: xdr
Introduced: 5.0.0
Depending on configuration, latency_ms
should be within the latency of the link between the DCs.
If latency_ms
increases beyond the expectations based on the distance (or known link latency) between clusters, alert operations to investigate.
Context: xdr
Introduced: 5.0.0
If recoveries
is consistently increasing alert operations to investigate.
Context: xdr
Introduced: 5.0.0
If recoveries_pending
is unexpectedly increasing alert operations to investigate.
Context: xdr
Introduced: 5.0.0
If retry_conn_reset
is consistently higher than expected alert operations to investigate.
Context: xdr
Introduced: 5.0.0
If retry_dest
is consistently higher than expected alert operations to investigate.
Context: xdr
Introduced: 5.1
If retry_no_node
is consistently higher than expected alert operations to investigate.
Context: xdr
Introduced: 5.0.0
If success
is consistently lower than expected alert operations to investigate.