Metrics

Some Aerospike client libraries have a data structure for cluster statistics, but the Python client works a bit differently. To use extended metrics, you must explicitly notify the client to track latency and command counts for every node.

To enable:

policy = MetricsPolicy(
    report_dir="/var/log/aerospike/metrics",
    interval=600,
)
# client is aerospike.Client object
client.enable_metrics(policy=policy)

To disable:

client.disable_metrics()

The MetricsPolicy fields are:

metrics_listeners: Listeners that handles metrics notification events. If set to None, the default listener implementation is used, which writes the metrics snapshot to a file which can later be read and forwarded to OpenTelemetry by a separate offline application. Otherwise, use all listeners set in the class instance.

The listener can be overridden to send the metrics snapshot directly to OpenTelemetry.

The following is a list of metrics_listeners fields:
- enable_listener: Called when metrics have been enabled for the cluster.
- snapshot_listener: Called when a metrics snapshot has been requested for the given cluster.
- node_close_listener: Called when a node is dropped from the cluster.
- disable_listener: Called when metrics have been disabled for the cluster.
report_dir: Directory path to write metrics log files for listeners that write logs.
report_size_limit: Metrics file size soft limit, in bytes, for listeners that write logs. When report_size_limit is reached or exceeded, the current metrics file is closed and a new metrics file is created with a new timestamp. If report_size_limit is set to 0, the metrics file size is unbounded and the file is only closed when aerospike.Client.disable_metrics() or aerospike.Client.close() is called.

Defaults to 0.
interval: Number of cluster tend iterations between metrics notification events. One tend iteration is defined as tend_interval in the client configuration, plus the time to tend all nodes.

Defaults to 30.
latency_columns: Number of elapsed time range buckets in latency histograms.

Defaults to 7.
latency_shift: Power of 2 multiple between each range bucket in latency histograms starting at column 3. The bucket units are in milliseconds. The first 2 buckets are <=1ms and >1ms. Examples:
```
# latencyColumns=7 latencyShift=1
# <=1ms >1ms >2ms >4ms >8ms >16ms >32ms

# latencyColumns=5 latencyShift=3
# <=1ms >1ms >8ms >64ms >512ms
```

The default extended metrics file includes:

cluster: Metrics about the cluster connected to by the client.
- name: Cluster name.
- cpu: Current CPU usage percentage of the client process.
- mem: Current memory usage of the client process.
- invalidNodeCount: Count of add node failures in the most recent cluster tend iteration.
- tranCount: Count of commands since client was started.
- retryCount: Count of command retries since the client was started.
- delayQueueTimeoutCount: Count, since client was started, of async commands that timed out in the delay queue before the command was processed.
  - eventloop: Metrics for each async event loop.
    - processSize: Approximate number of commands actively being processed on the event loop.
    - queueSize: Approximate number of commands stored on this event loop’s delay queue that have not been started yet.
node: Metrics for each node.
- name: Node name.
- address: Node IP address.
- port: Node port.
- syncConn: Sync connections.
  - inUse: Active connections from connection pools currently executing commands.
  - inPool: Initialized connections in connection pools that are not currently active.
  - opened: Total number of node connections opened since node was started.
  - closed: Total number of node connections closed since node was started.
- asyncConn: Async connections. These should always be 0 for the Python client.
  - inUse: Active connections from connection pools currently executing commands.
  - inPool: Initialized connections in connection pools that are not currently active.
  - opened: Total number of node connections opened since node was started.
  - closed: Total number of node connections closed since node was started.
- errors: Command error count since node was started. If the error is retryable, multiple errors per command may occur.
- timeouts: Command timeout count since node was started. If the timeout is retryable (such as socket_timeout), multiple timeouts per command may occur.
- latency: Latency buckets for the following types:
  - conn: Connection creation latency.
  - write: Single record write commands.
  - read: Single record read commands.
  - batch: Batch read/write commands.
  - query: Scan/Query commands.

Extended metrics file format: <reportDir>/metrics-yyyyMMddHHmmss.log

Extended metrics file example:

2023-08-03 17:56:45.444 header(1) cluster[name,cpu,mem,invalidNodeCount,commandCount,retryCount,delayQueueTimeoutCount,eventloop[],node[]] eventloop[processSize,queueSize] node[name,address,port,syncConn,asyncConn,errors,timeouts,latency[]] conn[inUse,inPool,opened,closed] latency(5,3)[type[l1,l2,l3...]]
2023-08-03 17:57:45.472 cluster[,0,29539536,0,86,0,0,[],[[BB9BF3DDF290C00,172.16.70.243,3000,0,1,2,0,0,0,0,0,0,0,[conn[0,0,0,0,0],write[6,1,0,0,0],read[14,0,0,0,0],batch[6,3,0,0,0],query[0,0,0,0,0]]],[BCDBF3DDF290C00,172.16.70.243,3020,0,1,2,0,0,0,0,0,2,0,[conn[1,0,0,0,0],write[13,1,0,0,0],read[3,0,0,0,0],batch[9,0,0,0,0],query[0,0,0,0,0]]],[BC3BF3DDF290C00,172.16.70.243,3010,0,1,2,0,0,0,0,0,0,0,[conn[1,0,0,0,0],write[7,1,0,0,0],read[27,0,0,0,0],batch[10,0,0,0,0],query[0,0,0,0,0]]]]]
2023-08-03 17:58:45.476 cluster[,0,29539536,0,86,0,0,[],[[BB9BF3DDF290C00,172.16.70.243,3000,0,1,2,0,0,0,0,0,0,0,[conn[0,0,0,0,0],write[6,1,0,0,0],read[14,0,0,0,0],batch[6,3,0,0,0],query[0,0,0,0,0]]],[BCDBF3DDF290C00,172.16.70.243,3020,0,1,2,0,0,0,0,0,2,0,[conn[1,0,0,0,0],write[13,1,0,0,0],read[3,0,0,0,0],batch[9,0,0,0,0],query[0,0,0,0,0]]],[BC3BF3DDF290C00,172.16.70.243,3010,0,1,2,0,0,0,0,0,0,0,[conn[1,0,0,0,0],write[7,1,0,0,0],read[27,0,0,0,0],batch[10,0,0,0,0],query[0,0,0,0,0]]]]]
2023-08-03 17:59:45.483 cluster[,0,29539536,0,86,0,0,[],[[BB9BF3DDF290C00,172.16.70.243,3000,0,1,2,0,0,0,0,0,0,0,[conn[0,0,0,0,0],write[6,1,0,0,0],read[14,0,0,0,0],batch[6,3,0,0,0],query[0,0,0,0,0]]],[BCDBF3DDF290C00,172.16.70.243,3020,0,1,2,0,0,0,0,0,2,0,[conn[1,0,0,0,0],write[13,1,0,0,0],read[3,0,0,0,0],batch[9,0,0,0,0],query[0,0,0,0,0]]],[BC3BF3DDF290C00,172.16.70.243,3010,0,1,2,0,0,0,0,0,0,0,[conn[1,0,0,0,0],write[7,1,0,0,0],read[27,0,0,0,0],batch[10,0,0,0,0],query[0,0,0,0,0]]]]]
...