Monitor latency with histograms

Histograms plot the distribution of latencies for various transaction types that are processed by an Aerospike cluster. You can use them for in-depth analysis of the activity on your system.

Writes to the log file

Periodically, the Aerospike server writes histograms to the log file. The frequency of writes to the log file is controlled by the ticker-interval configuration parameter.

Histograms begin with the text “histogram dump:” followed by:

the histogram name
the number of measurements represented by the histogram
a maximum of four buckets each
- The form for each bucket is (BUCKET_NUMBER: NUM_OBSERVED). The threshold for a given bucket is 2^{BUCKET_NUMBER} units.

Example histogram dump

The following example shows a read histogram for the namespace test where 36,453,689,581 measurements are reported, 36,414,385,962 of which were less than 1ms (2⁰ms):

Jun 09 2020 21:40:06 GMT: INFO (info): (hist.c:240) histogram dump: {test}-read (36453689581 total) msec
Jun 09 2020 21:40:06 GMT: INFO (info): (hist.c:257)  (00: 36414385962) (01: 0031233450) (02: 0007841324) (03: 0000215526)
Jun 09 2020 21:40:06 GMT: INFO (info): (hist.c:257)  (04: 0000002198) (05: 0000000693) (06: 0000000590) (07: 0000000716)
Jun 09 2020 21:40:06 GMT: INFO (info): (hist.c:257)  (08: 0000001405) (09: 0000002365) (10: 0000004646) (11: 0000000706)

Some histograms provide metrics which are not latency related, such as accessed object sizes on read/write transactions, and number of records returned by secondary index queries.

High level characteristics of histogram output

Histograms only appear in the logs if they contain data. For example, a batch-specific histogram on a server which does not do any batch read commands would not be present.
Read and write histograms are tracked on per-namespace level basis. If latencies are only observed on a specific namespace, it is easier to track down based on the specific data in the namespace and the application writing to it.
To analyze transactions in detail, turn on the relevant configuration-enabled benchmark histograms specific to the transaction type.
Turning on a couple of the non-default histograms usually does not impact the performance of a cluster. However, in some cases, turning on more than a few of those configurable histograms could have a noticeable impact on the performance of a cluster.
Analyzing a time period that bridges a server restart distorts results of time slices and average/maximum calculations.
Histogram data can also be retrieved through the asadm show latencies command.
By default, latency histograms are measured in milliseconds, and can be configured to microseconds. See [microsecond-histograms]((/database/reference/config#microsecond-histograms) for details.

Notational conventions

Required braces, like {ns}, where ns is the name of a namespace, indicate an input variable you supply on the asinfo command line.
Brackets, like fabric-[channel]-send-init, for the different fabric channels, indicate an output name of the particular histogram type.

Benchmarks

Benchmarks are active as either auto-enabled or configuration-enabled.

Auto-enabled benchmarks

Auto-enabled benchmarks are a set of histograms that measure end-to-end performance of Aerospike subsystems and appear only when the histogram has data to display. They give a high-level view of what is happening on a node at a given point in time. These benchmark stats are written to the Aerospike log file at 10 second intervals and can be used to spot things such as latency in a certain transaction type.

{ns}-read: Time taken for read requests from the time they are received at the node to when the response leaves the node.
{ns}-write: Time taken for end-to-end writes, which includes the time taken for replica writes. Does not include deletes.
{ns}-udf: Time taken for UDF transactions.
batch-index: Batch read “parent” transaction. Tracks the time taken from the start of the transaction to getting a response. If batch calls are showing high latencies on the client side, consider changing the maxConcurrentThreads. For example, in the Java client library, set Batch Policy from the default 1 (sequential) to 0 (or a number greater than 0). Configuring the maxConcurrentThreads allows concurrent synchronous batch request threads to be sent to server nodes at any time.
batch-rec-count: The number of records per batch request sent to the cluster node. This histogram shows the distribution of counts, and not latencies. This histogram is introduced in Database 7.1.
{ns}-batch-sub-read: Time taken for batch (sub) read requests from the time they are received at the node to when the response leaves the node. Introduced in Database 6.4.
{ns}-batch-sub-write: Time taken for batch (sub) writes from end-to-end (includes the time taken for replica write). Does not include deletes. Introduced in Database 6.4.
{ns}-batch-sub-udf: Time taken for batch (sub) UDF transactions. Introduced in Database 6.4.
{ns}-query: Time taken for query transactions. Removed as of Database 6.0.
{ns}-query-rec-count: Records returned per query, showing the distribution of counts, not latencies. The first bucket in the histogram is always empty since queries not returning any records are not counted. Removed as of Database 6.0.
{ns}-pi-query: Time taken for primary index short query transactions. Introduced in Database 6.0.
{ns}-pi-query-rec-count: Displays the records returned per primary index short query. This histogram shows the distribution of counts, and not latencies. The first bucket in histogram is always empty as queries not returning any records do not get counted in this histogram. Introduced in Database 6.0.
{ns}-si-query: Time taken for secondary index short query transactions. This histogram is introduced in Database 6.0.
{ns}-si-query-rec-count: Displays the records returned per secondary index short query. This histogram shows the distribution of counts, and not latencies. The first bucket in histogram is always empty as queries not returning any records do not get counted in this histogram. Introduced in Database 6.0.
{ns}-re-repl: Latencies when re-replicating. Only available in strong consistency enabled namespaces.
{ns}-read-touch: Time taken for read commands that also touch the record according to the default-read-touch-ttl-pct configuration parameter (or equivalent client policy). Introduced in Database 7.1.
{ns}-writes-per-mrt: Displays the number of records written per transaction. This histogram shows the distribution of counts, and not latencies. Introduced in Database 8.0. If transactions with writes have been executed, then this histogram will appear in the server’s ticker log. For all of these histograms which appear in the logs, we use the asloglatency tool to view them.

Configuration-enabled benchmarks

Configuration-enabled benchmarks include specific histograms that you can enable or disable dynamically. These benchmarks are more granular than the auto-enabled benchmarks. If you contact Aerospike Support for assistance with a problem, they may request that you enable a few sets that are relevant to your issue. If you keep these benchmarks permanently enabled, they may impact performance.

General transaction histogram

To enable a specific benchmark, use the following command:

asadm -e 'enable; manage config service param <configuration> to true'

Type of Histogram	Description	Configuration
info	Time taken for an info protocol transaction.	enable-hist-info

Specific transaction benchmarks

The transaction benchmarks are monitored at the per-namespace level.

Proxy transaction analysis

Displays time for a proxy transaction from the sender view.

asadm -e 'enable; manage config namespace <namespaceName> param enable-hist-proxy to true'

proxy: Time taken for proxy commands.

Fabric transaction analysis

Displays the time for a fabric transaction from sender’s view.

asadm -e 'enable; manage config service param enable-benchmarks-fabric to true'

fabric-[channel]-send-init: Tracks the time taken for preparing a message for sending over fabric (intra cluster communication). Does not include any network operations. Replace [channel] with one of bulk (migrations), ctrl (paxos and migration control), meta (smd) or rw (read/write transactions).
fabric-[channel]-send-fragment: Tracks the time taken from end of send-init and stops after the network send. Can be hit multiple times if message was too big to fit in the send buffer. Replace [channel] with one of bulk (migrations), ctrl (paxos and migration control), meta (smd) or rw (read/write transactions).
fabric-[channel]-recv-fragment: Tracks the time taken from the beginning of the first byte of a message from network and ends after network recv. Can be hit multiple times if message was big enough to require multiple network recv calls. Replace [channel] with one of bulk (migrations), ctrl (paxos and migration control), meta (smd) or rw (read/write transactions).
fabric-[channel]-recv-cb: Tracks the time taken to process the transaction on the receiving node (rw, bulk, control, meta). Replace [channel] with one of bulk (migrations), ctrl (paxos and migration control), meta (smd) or rw (read/write transactions).

Ops subtransaction analysis

Displays slices of internally generated ops subtransactions from a background scan or query.

asadm -e 'enable; manage config namespace <namespaceName> param enable-benchmarks-ops-sub to true'

ops-sub-start: Time taken from the beginning of a transaction to when a node partition is reserved.
ops-sub-restart: Time taken for a partition to be reserved to a partition being re-reserved (possible internal restart/reattempt, e.g. “key busy”). See the write-restart histogram for further details about transactions being reattempted.
ops-sub-dup-res: Time taken from a partition reserved to duplicates resolved (only during ongoing data rebalancing). Timeouts are not included.
ops-sub-master: Time taken from partition reserved or after duplicate resolution to an actual master record applied.
ops-sub-repl-write: Time taken from master record written to replica(s) written.
ops-sub-response: Time taken from master write (if write-commit-level configured to master) or replica write to client response completion.

Read transaction analysis

Displays slices of a single-record read command.

asadm -e 'enable; manage config namespace <namespaceName> param enable-benchmarks-read to true'

read-start: Time taken from the beginning of a transaction to when a node partition is reserved. High latencies here could indicate bottleneck at the network or CPU (and often irq balancing) as well as in the transaction queue. There are no transaction queues. Thus only the demarshal time slice is relevant.
read-restart: Time taken for a partition to be reserved to a partition being re-reserved (possible internal restart/reattempt, e.g. “key busy” on duplicate resolve). For this slice, in the case of a potential hotkey, the clock runs from when the transaction gets into the rw-hash, until it is reattempted. The process is similar but there is no initial transaction queue or thread phase. After the transaction being processed completes, all transactions waiting on the rw-hash are put into an internal transaction queue in order to be reattempted.

The same transaction can be re-queued multiple times. Each is counted as a separate transaction in this slice. However, the clock resets every time a transaction gets into the rw-hash. A transaction being re-queued multiple times will not have its total accumulated time accounted for, but only the time it waited in the rw-hash (queued behind a transaction in progress) the last time prior to going through.
read-dup-res: Time taken from a partition reserved to duplicates resolved (only during ongoing data rebalancing). Timeouts are not included.
read-local: Time taken from partition reserved or after duplicate resolution to an actual local record read. For data-in-memory true, this is the slice for reading from memory. For data-in-memory false, this is the slice for reading from the disk except when the record is still in the post-write-cache or in cached (read-page-cache enabled). To troubleshoot further, enable storage benchmarks.
read-response: Time taken from local record read to client response completion. High latencies here could indicate a bottleneck on the network or the client, getting slow to consume responses from the server.
read-repl-ping: Time taken for reading all the copies of a record when the linearize read policy is in effect. This only applies for strong consistency enabled namespaces.

Write transaction analysis

Displays slices of a single-record write command, excluding deletes.

asadm -e 'enable; manage config namespace <namespaceName> param enable-benchmarks-write to true'

write-start: Time taken from the beginning of a transaction to when a node partition is reserved. High latency here could indicate a bottleneck at the network or CPU, and often IRQ balancing.

A network bottleneck could be explained by max-record-size exceeding the MTU allowed by the network. If leveraging batch writes, an MTU that is smaller than the record sizes or batch size can cause subsequent fragments of a transaction to be delayed when picked up by a service thread.

When compression is enabled, higher latencies could also potentially be observed in this slice. There are no transaction queues, so only the demarshal time slice is relevant.
write-restart: Time taken for a partition to be reserved to a partition being re-reserved (possible internal restart/reattempt, e.g. “key busy”). For this slice, in the case of a potential hotkey, the clock runs from when the transaction gets into the rw-hash, until it is reattempted.

There is no initial transaction queue or thread phase; after the transaction being processed completes, all transactions waiting on the rw-hash are put into an internal transaction queue to be reattempted.

The same transaction can be re-queued multiple times. Each is counted as a separate transaction in this slice. However, the clock resets every time a transaction gets into the rw-hash. A transaction being re-queued multiple times will not have its total accumulated time accounted for, but rather only the time it waited in the rw-hash (queued behind a transaction in progress) the last time prior to going through. This also includes restarts from duplicate resolution if any.
write-dup-res: Time taken from a partition reserved to duplicates resolved (only during ongoing data rebalancing). Timeouts are not included.
write-master: Time between when the partition is reserved or after a duplicate resolution finishes, to when an actual master record write to the streaming write buffer (SWB) in-memory takes place. Latencies displayed here typically cover the time it took to update secondary indexes when necessary, and/or the read portion of the record write (if a write is an update, the record is read before a write is performed). In strong consistency namespaces, if the record exists and is not replicated, Aerospike leaves the write-master slice immediately and then triggers two write commands – one for re-repl and one for the initial, postponed write. The following are details of what is counted in this slice:
- read before write because of update
- read before write because of replace and using sendKey=true
- compression at rest
- sindex
- larger number of bins with bin convergence
- commit-to-device true (in SC)
- binop processing
- filter processing (if the filter requires reading a bin to evaluate)
write-repl-write: Time taken from master record written to replica(s) written.
write-response: If write-commit-level-override is configured to master, this is the time taken from master or replica write to client response completion. High latencies here could indicate a bottleneck on the network or the client getting slow to consume responses from the server.

UDF transaction analysis

Displays slices of a single-record UDF transaction.

asadm -e 'enable; manage config namespace <namespaceName> param enable-benchmarks-udf to true'

udf-start: Time taken from the beginning of a transaction to when a node partition is reserved. This includes the demarshal part (tracked for all transactions under svc-demarshal for Database versions prior to 4.8) as well as the time spent in the transaction queue (tracked for all transactions under svc-queue for Database versions prior to 4.8). As of Database 4.7, there are no transaction queues and thus only the demarshal time slice is relevant.
udf-restart: Time taken for a partition to be reserved to a partition being re-reserved (possible internal restart/reattempt, e.g. “key busy”). See the write-restart histogram for further details about transactions being reattempted.
udf-dup-res: Time taken from a partition reserved to duplicates resolved (only during ongoing data rebalancing). Timeouts are not included.
udf-master: Time taken from partition reserved or after duplicate resolution to an actual master record applied.
udf-repl-write: Time taken from master record written to replica(s) written.
udf-response: Time taken from master write (if write-commit-level configured to master) or replica write to client response completion.

Sub-UDF transaction analysis

Displays the slices of internally generated UDF transactions, usually by a background UDF, scan or a query.

asadm -e 'enable; manage config namespace <namespaceName> param enable-benchmarks-udf-sub to true'

udf-sub-start: Time taken from the beginning of a transaction to when a node partition is reserved. There are no transaction queues and thus only the demarshal time slice is relevant.
udf-sub-restart: Time taken for a partition to be reserved to a partition being re-reserved (possible internal restart/reattempt, e.g. “key busy”). See the write-restart histogram for further details about transactions being reattempted.
udf-sub-dup-res: Time taken from a partition reserved to duplicates resolved (only during ongoing data rebalancing). Timeouts are not included.
udf-sub-master: Time taken from partition reserved or after duplicate resolution to an actual master record applied.
udf-sub-response: Time taken from master write (if write-commit-level configured to master) or replica write to client response completion.

Batch transaction analysis

Displays the slices for a batch transaction.

asadm -e 'enable; manage config namespace <namespaceName> param enable-benchmarks-batch-sub to true'

batch-sub-udf-master: Time taken from partition reserved or after duplicate resolution to an actual master record applied using UDF.
batch-sub-repl-write: Time taken from master record written to replica(s) written.
batch-sub-write-master: Time taken from partition reserved or after duplicate resolution to an actual master record applied.
batch-sub-prestart: The demarshal time for a batch sub-transaction. For single record transactions the ‘-start’ slice covers the time from the demarshal to the partition reservation. For batch transactions multiple individual transactions are involved so the portion to deserialize off the network socket (demarshal) and internally queue (batch-sub-prestart) is separated from the portion where individual transactions (batch-sub-transactions) wait to be picked up by a service thread until the partition reservation (batch-sub-start). Higher latencies in this slice are typically caused by larger transactions where some TCP segments take longer to get to the server. For example, inadequate MTU or client/network delays can cause the deserialization to wait for TCP segments to arrive.
batch-sub-start: Time taken from a batch sub-transaction being queued until a node partition is reserved. Higher latencies in this slice indicate batch sub-transactions waiting longer before being picked up by a service thread. Depending on the situation, if the system can handle the workload, increasing the number of service-threads or tuning the inline policies on the client side can help.
- Prior to Database 4.8, the time taken from the beginning of a transaction to when a node partition is reserved. This includes the demarshal part (tracked for all transactions under svc-demarshal) as well as the time spent in the transaction queue (tracked for all transactions under svc-queue). There are no transaction queues and only the demarshal time slice is relevant.
- Prior to Database 4.4, this histogram only included the final “chunk” when reading a large request out of the socket, and could under-report the latency.
- Database 4.8.0.1 introduced a problem where this histogram also contains data points that should instead be in the batch-sub-restart histogram. This is fixed as of Database 4.8.0.27, and also fixed for other server lineages in versions 4.9.0.25,5.0.0.28, 5.1.0.26, 5.2.0.18, 5.3.0.9, and 5.4.0.4.
batch-sub-restart: Time taken from a partition being reserved to a partition being re-reserved (possible internal restart/reattempt). See the write-restart histogram for details about transactions being reattempted.
batch-sub-dup-res: Time taken from a partition reserved to duplicates resolved (only during ongoing data rebalancing). Timeouts are not included.
batch-sub-read-local: Time taken from partition reserved or after duplicate resolution to an actual local record read.
batch-sub-response: Time taken from local record read to batch parent response completion.
batch-sub-repl-ping: Time taken for reading all the copies of a record when the linearize read policy is in effect. This only applies for strong consistency enabled namespaces.

Storage transaction analysis

Storage benchmarks are a set of device-targeted histograms that measure performance of the storage subsystem in time slices. They provide insight into performance issues related to storage devices and PMem storage files. Due to their performance impact, they are disabled by default. You can enable them by passing the set-config command to the asinfo utility.

To enable storage-benchmarks:

asadm -e 'enable; manage config namespace NAMESPACE storage-engine param enable-benchmarks-storage to true'

{ns}-[device]-read: Tracks the time taken to read from disk. Not available for PMem storage.
{ns}-[device]-write: Tracks amount of time taken to write a write block to disk. Not available for PMem storage. Since Database 7.1, the write-block-size configuration parameter has been replaced by flush-size, which specifies the size of these writes. Prior to Database 7.1, the size of a write block is the configured write-block-size. When running with commit-to-device set to true with versions prior to Database 5.3, this counter will only account for full blocks written and therefore will only count blocks written through the defragmentation process as client writes would write to disk individually rather than at a block level. But when running with commit-to-device set to true with Database 5.3 and later, client writes will also be included in the histogram.
{ns}-[device]-shadow-write: Tracks amount of time taken to write a write block to a shadow device, when using shadow devices. Not available for PMem storage. In Database 7.1, the write-block-size configuration parameter was replaced by flush-size, which specifies the size of these writes. Prior to Database 7.1, the size of a write block is the configured write-block-size. When running with commit-to-device set to true with versions prior to 5.3, this counter will only account for full blocks written and therefore will only count blocks written through the defragmentation process as client writes would write to disk individually rather than at a block level. But when running with commit-to-device set to true with Database 5.3 and later, client writes will also be included in the histogram.
{ns}-[device]-large-block-read: Tracks the amount of time to read a write block from disk. Not available for PMem storage. In Database 7.1, the write-block-size configuration parameter was replaced by flush-size, which specifies the size of these reads. Prior to Database 7.1, the size of a write-block is the configured write-block-size. As of Database 5.3, tomb-raider device reads are included in this histogram.
{ns}-[device]-fsync: Tracks the time taken for fsync. Available in Database 4.3.0.10 and below. Deprecated in Database 4.3.1.3.
{ns}-device-read-size: Tracks the size of the objects being read.
{ns}-device-write-size: Tracks the size of the objects being written. This includes replica writes from other nodes (on namespaces with replication factor 2 or more). Does not include defragmentation writes.

Secondary index transaction analysis

In Database 5.7 and earlier, secondary index benchmarks are a set of benchmarks targeted to a particular index measuring various components of a query and index. Due to their performance impact, they are disabled by default. You can enable them by passing the sindex-histogram command to the asinfo utility.

To enable secondary index benchmarks for a specific index:

asinfo -h [host ip] -v "sindex-histogram:ns=NAMESPACE;indexname=INDEX;enable=true"

[SINDEX_NAME]_write_us: Histogram measuring secondary index write performance, unit is microseconds.
[SINDEX_NAME]_si_prep_us: Histogram of time spent gathering secondary index key values from records, unit is microseconds.
[SINDEX_NAME]_delete_us: Histogram measuring secondary index delete performance, unit is microseconds.
[SINDEX_NAME]_query: Histogram for entire query execution time, unit is milliseconds.
[SINDEX_NAME]_query_batch_lookup_us: Histogram of time spent performing secondary index lookup from primary index digest tree for the qualifying records for the specific secondary index query, unit is microseconds.
[SINDEX_NAME]_query_batch_io_us: Histogram of time spent reading qualified records from storage for the secondary index query, unit is microseconds.
[SINDEX_NAME]_query_row_count: Histogram of rows selected by query. Example: If a secondary index lookup returns 100 records, this value will be 100. This is not necessarily the number of records returned to the application, though. The actual number of records returned to the query is (query_row_count - query_diff_count), which is added in the namespace level histogram, query_rec_count.
[SINDEX_NAME]_query_diff_count: Histogram of false positives returned by sindex.

Example: If a sindex returned 100 records, but the query read only 95 records, that would indicate there were five false positives that may be the result of garbage collection.