Metrics
C# client 7.2.0 provides two levels of metrics: ClusterStats and Extended.
ClusterStats
ClusterStats is an on-demand snapshot of a cluster’s thread and connection
usage. To obtain ClusterStats on an active AerospikeClient
instance:
ClusterStats stats = client.GetClusterStats();
ClusterStats contains the following:
-
nodes
: Metrics for each node.-
syncStats
: Sync connections.inUse
: Active connections from connection pool(s) currently executing commands.inPool
: Initialized connections in connection pool(s) that are not currently active.opened
: Total number of node connections opened since node was started.closed
: Total number of node connections closed since node was started.
-
asyncStats
: Async connections.inUse
: Active connections from connection pool(s) currently executing commands.inPool
: Initialized connections in connection pool(s) that are not currently active.opened
: Total number of node connections opened since node was started.closed
: Total number of node connections closed since node was started.
-
errorCount
: Command error count since node was started. If the error is retryable, multiple errors per command may occur. -
timeoutCount
: Command timeout count since node was started. If the timeout is retryable (iesocketTimeout
), multiple timeouts per command may occur.
-
-
threadsInUse
: Number of active threads executing sync batch/scan/query commands. -
completionPortsInUse
: Number of active async completion ports. -
invalidNodeCount
: Count of add node failures since the client was started. If greater than zero, some peer nodes are not accessible by the client. -
retryCount
: Count of command retries since the client was started.
Extended Metrics
To use extended metrics, you must explicitly notify the client to track latency and command counts for every node. There is a performance penalty when extended metrics is enabled. To enable:
MetricsPolicy mp = new MetricsPolicy();mp.reportDir = "/var/log/aerospike/metrics";mp.interval = 600; // Write metrics snapshot approximately every 600 seconds.client.EnableMetrics(mp);
To disable:
client.DisableMetrics();
The MetricsPolicy fields are:
-
Listeners
: Listener that handles metrics notification events. The default listener implementation writes the metrics snapshot to a file. The listener fields could also be overridden to send the metrics snapshot directly to OpenTelemetry. The Listener fields are the following:-
OnEnable
: Called when metrics have been enabled for the cluster. -
OnSnapshot
: Called when a metrics snapshot has been requested for the given cluster. -
OnNodeClose
: Called when a node is being dropped from the cluster. -
OnDisable
: Called when metrics have been disabled for the cluster.
-
-
ReportDir
: Directory path to write metrics log files for listeners that write logs. -
ReportSizeLimit
: Metrics file size soft limit in bytes for listeners that write logs. Whenreport_size_limit
is reached or exceeded, the current metrics file is closed and a new metrics file is created with a new timestamp. Ifreport_size_limit
is set to 0, the metrics file size is unbounded and the file is only closed whenDisableMetrics()
orClose()
is called. Default: 0. -
Interval
: Number of cluster tend iterations between metrics notification events. One tend iteration is defined asClientPolicy.tendInterval
(default: 1 second) plus the time to tend all nodes. Default: 30. -
LatencyColumns
: Number of elapsed time range buckets in latency histograms. Default: 7. -
LatencyShift
: Power of 2 multiple between each range bucket in latency histograms starting at column 3. The bucket units are in milliseconds. The first 2 buckets are<=1ms
and>1ms
. Examples:
// latency_columns=5 latency_shift=3<=1ms >1ms >8ms >64ms >512ms// latency_columns=7 latency_shift=2<=1ms >1ms >4ms >16ms >64ms >256ms >1024ms
The default extended metrics file includes all ClusterStats fields plus the following:
-
clusterName
: Cluster name. -
cpu
: Current CPU usage percentage of the client process. -
mem
: Current memory usage of the client process. -
tranCount
: Count of commands since client was started. -
delayQueueTimeoutCount
: Count since client was started of async commands that timed out in the delay queue before the command was processed. -
nodes
: Metrics for each node.-
nodeName
: Node name. -
address
: Node IP address. -
port
: Node port. -
latency
: Latency buckets for the following types:-
conn
: Connection creation latency. -
write
: Single record write commands. -
read
: Single record read commands. -
batch
: Batch read/write commands. -
query
: Scan/Query commands.
-
-
Extended metrics file format: <report_dir>/metrics-yyyyMMddHHmmss.log
Extended metrics file example:
2023-08-03 17:56:45.444 header(1) cluster[name,cpu,mem,invalidNodeCount,tranCount,retryCount,delayQueueTimeoutCount,asyncThreadsInUse,asyncCompletionPortsInUse,node[]] node[name,address,port,syncConn,asyncConn,errors,timeouts,latency[]] conn[inUse,inPool,opened,closed] latency(5,3)[type[l1,l2,l3...]]2023-08-03 17:57:45.472 cluster[,0,29539536,0,86,0,0,0,0,[[BB9BF3DDF290C00,172.16.70.243,3000,0,1,2,0,0,0,0,0,0,0,[conn[0,0,0,0,0],write[6,1,0,0,0],read[14,0,0,0,0],batch[6,3,0,0,0],query[0,0,0,0,0]]],[BCDBF3DDF290C00,172.16.70.243,3020,0,1,2,0,0,0,0,0,2,0,[conn[1,0,0,0,0],write[13,1,0,0,0],read[3,0,0,0,0],batch[9,0,0,0,0],query[0,0,0,0,0]]],[BC3BF3DDF290C00,172.16.70.243,3010,0,1,2,0,0,0,0,0,0,0,[conn[1,0,0,0,0],write[7,1,0,0,0],read[27,0,0,0,0],batch[10,0,0,0,0],query[0,0,0,0,0]]]]]2023-08-03 17:58:45.476 cluster[,0,29539536,0,86,0,0,0,0,[[BB9BF3DDF290C00,172.16.70.243,3000,0,1,2,0,0,0,0,0,0,0,[conn[0,0,0,0,0],write[6,1,0,0,0],read[14,0,0,0,0],batch[6,3,0,0,0],query[0,0,0,0,0]]],[BCDBF3DDF290C00,172.16.70.243,3020,0,1,2,0,0,0,0,0,2,0,[conn[1,0,0,0,0],write[13,1,0,0,0],read[3,0,0,0,0],batch[9,0,0,0,0],query[0,0,0,0,0]]],[BC3BF3DDF290C00,172.16.70.243,3010,0,1,2,0,0,0,0,0,0,0,[conn[1,0,0,0,0],write[7,1,0,0,0],read[27,0,0,0,0],batch[10,0,0,0,0],query[0,0,0,0,0]]]]]2023-08-03 17:59:45.483 cluster[,0,29539536,0,86,0,0,0,0,[[BB9BF3DDF290C00,172.16.70.243,3000,0,1,2,0,0,0,0,0,0,0,[conn[0,0,0,0,0],write[6,1,0,0,0],read[14,0,0,0,0],batch[6,3,0,0,0],query[0,0,0,0,0]]],[BCDBF3DDF290C00,172.16.70.243,3020,0,1,2,0,0,0,0,0,2,0,[conn[1,0,0,0,0],write[13,1,0,0,0],read[3,0,0,0,0],batch[9,0,0,0,0],query[0,0,0,0,0]]],[BC3BF3DDF290C00,172.16.70.243,3010,0,1,2,0,0,0,0,0,0,0,[conn[1,0,0,0,0],write[7,1,0,0,0],read[27,0,0,0,0],batch[10,0,0,0,0],query[0,0,0,0,0]]]]]...