---
title: "ABS Monitoring"
description: "Monitor Aerospike Backup Service (ABS) using Prometheus metrics, Grafana dashboards, PromQL queries, and alerts."
---

# ABS Monitoring

> For the complete documentation index see: [llms.txt](https://aerospike.com/docs/llms.txt)
> 
> All documentation pages available in markdown.

Aerospike Backup Service (ABS) exposes system metrics that [Prometheus](https://prometheus.io/) can scrape.

## Prometheus configuration

ABS exposes metrics directly on its HTTP port, so you don’t need a dedicated Prometheus exporter. By default, metrics are available at `http://<ABS_HOST>:8080/metrics`. You can change the port with the [`service.http.port`](https://aerospike.com/docs/database/tools/backup-and-restore/backup-service/config#service.http.port) parameter.

The following example shows a standalone Prometheus configuration for scraping ABS metrics:

/etc/prometheus/prometheus.yml

```yaml
global:

  scrape_interval: 15s

scrape_configs:

  - job_name: 'aerospike-backup-service'

    static_configs:

      - targets: ['abs-service:8080']
```

Replace `abs-service:8080` with your ABS host and port.

::: note
If you also monitor Aerospike Database servers with [Aerospike Prometheus Exporter](https://aerospike.com/docs/database/observe/monitor/components/), keep the ABS scrape configuration as a separate `job_name` to avoid confusion and make it easier to manage alerts and dashboards for each service independently.
:::

## Grafana dashboard

A pre-built [Grafana dashboard](https://grafana.com/grafana/dashboards/21375-aerospike-backup-service/) is available for visualizing ABS metrics. The dashboard includes panels for backup success and failure rates, backup duration, and restore operations.

## Metrics

ABS includes the following application metrics:

| Name | Description | Labels |
| --- | --- | --- |
| `aerospike_backup_service_backup_duration_seconds` | Duration in seconds of finished backups by routine and type (full/incremental) | routine, type |
| `aerospike_backup_service_backup_events_total` | Backup service job events by routine, type (full/incremental), and outcome (success, failure, canceled, retry, skip) | routine, type, outcome |
| `aerospike_backup_service_backup_progress_pct` | Progress of backup processes in percentage | routine, type |
| `aerospike_backup_service_last_successful_backup_timestamp` | Unix timestamp of the last successful backup per routine and type (full/incremental) | routine, type |
| `aerospike_backup_service_restore_in_progress` | Number of restore processes running |  |

::: note
For in-depth guidance on using the pipeline metric and `aerospike_backup_service_backup_duration_seconds` to tune backup throughput, see [Performance tuning](https://aerospike.com/docs/database/tools/backup-and-restore/backup-service/performance-tuning).
:::

### Backup cancellation outcomes

The `aerospike_backup_service_backup_events_total{outcome="canceled"}` series increases when a backup is canceled.

-   If a user explicitly cancels a backup (using the [Cancel all jobs for a backup routine](https://aerospike.com/docs/database/tools/backup-and-restore/backup-service/api-examples#cancel-all-jobs-for-a-backup-routine) endpoint) or disables the routine, this metric increase is expected.
-   If the service shuts down gracefully during a running backup, logs can report the backup as canceled, but Prometheus may miss the final increment if it cannot scrape before shutdown completes.

### Deprecated metrics

The following metrics are deprecated as of ABS 3.0. Use the recommended replacement metrics instead.

Click to show deprecated metrics

| Name | Description | Replacement |
| --- | --- | --- |
| `aerospike_backup_service_runs_total` | Successful backup runs counter | `aerospike_backup_service_backup_events_total` |
| `aerospike_backup_service_incremental_runs_total` | Successful incremental backup runs counter | `aerospike_backup_service_backup_events_total` |
| `aerospike_backup_service_skip_total` | Full backup skip counter | `aerospike_backup_service_backup_events_total` |
| `aerospike_backup_service_incremental_skip_total` | Incremental backup skip counter | `aerospike_backup_service_backup_events_total` |
| `aerospike_backup_service_failure_total` | Full backup failure counter | `aerospike_backup_service_backup_events_total` |
| `aerospike_backup_service_incremental_failure_total` | Incremental backup failure counter | `aerospike_backup_service_backup_events_total` |
| `aerospike_backup_service_duration_millis` | Full backup duration in milliseconds | `aerospike_backup_service_backup_duration_seconds` |
| `aerospike_backup_service_incremental_duration_millis` | Incremental backup duration in milliseconds | `aerospike_backup_service_backup_duration_seconds` |

## Example PromQL queries

Monitor and alert on backup performance with the following queries in Grafana panels or the Prometheus expression browser.

-   Number of successful full and incremental backups for a specific routine:
    
    Terminal window
    
    ```bash
    sum by (type) ( aerospike_backup_service_backup_events_total{routine="daily-ns1", outcome="success"} )
    ```
    
-   Number of failed backups per routine:
    
    Terminal window
    
    ```bash
    sum by (routine) ( aerospike_backup_service_backup_events_total{outcome="failure"} )
    ```
    
-   Number of canceled backups per routine:
    
    Terminal window
    
    ```bash
    sum by (routine) ( aerospike_backup_service_backup_events_total{outcome="canceled"} )
    ```
    
-   Average backup duration per routine:
    
    Terminal window
    
    ```bash
    rate(aerospike_backup_service_backup_duration_seconds_sum[5m]) / rate(aerospike_backup_service_backup_duration_seconds_count[5m])
    ```
    
-   Time since last full backup for a routine:
    
    Terminal window
    
    ```bash
    time() - aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1", type="full"}
    ```
    
-   Time since most recent backup for a routine regardless of backup type:
    
    Terminal window
    
    ```bash
    time() - max(aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1"})
    ```
    

## Example Prometheus alerts

Integrate ABS metrics into your Prometheus alerting pipeline to stay informed of job failures or service latencies.

-   Detect backup job failures recorded within the last 15 minutes with this alert.
    
    ```yaml
    - alert: BackupJobFailureDetected
    
    expr: increase(aerospike_backup_service_backup_events_total{outcome="failure"}[15m]) > 0
    
    for: 0m
    
    labels:
    
      severity: warning
    
    annotations:
    
      summary: "Backup job failure detected"
    
      description: "A backup failure was detected in the last 15 minutes for routine {{ $labels.routine }}."
    ```
    
-   Ensure backup continuity with this alert, which detects if a specific routine has failed to complete successfully within the last 24 hours.
    
    ```yaml
    - alert: BackupTooOld
    
    expr: time() - max(aerospike_backup_service_last_successful_backup_timestamp{routine="daily-ns1"}) > 86400
    
    for: 0m
    
    labels:
    
      severity: critical
    
    annotations:
    
      summary: "Backup is older than 24 hours"
    
      description: "The last successful backup for routine daily-ns1 was more than 24 hours ago."
    ```
    

## Process and Go runtime metrics

ABS also exposes standard process and Go runtime metrics on `/metrics`. Use them to detect resource saturation and runtime behavior changes before backup failures occur.

| Metric | Description | What to watch |
| --- | --- | --- |
| `process_cpu_seconds_total` | Cumulative CPU time (seconds) across all cores. | `rate(process_cpu_seconds_total[5m]) * 100` gives CPU percent per core. A sustained value above 100 means ABS uses more than one core on average. |
| `process_resident_memory_bytes` | Resident set size (RSS) in bytes. | Keep below container memory limits and watch for sustained growth during large backup or restore windows. |
| `process_open_fds` | Open file descriptors held by ABS. | Compare with `process_max_fds`; a sustained high ratio indicates descriptor pressure and possible `"too many open files"` errors. |
| `process_max_fds` | Hard limit for open file descriptors. | Track `process_open_fds / process_max_fds` and alert when the ratio approaches 1.0 for multiple scrape intervals. |
| `go_goroutines` | Active Go goroutines. | In a stable workload the count stays bounded. A monotonic increase usually indicates stalled background tasks or leaked goroutines. |
| `go_memstats_heap_alloc_bytes` | Go heap bytes allocated for live objects. | Compare with `process_resident_memory_bytes` to separate Go heap growth from non-heap memory pressure. |

Prometheus query examples:

-   `rate(process_cpu_seconds_total[5m]) * 100`
    
    -   The `process_cpu_seconds_total` counter tracks total CPU time since the process started. This query turns that running total into a per-second average over the last 5 minutes, then multiplies by 100 to get a percentage. In this scale, 100 means one full CPU core is busy and 200 means two cores. If the result stays above your expected core budget for several minutes, compare the spike with `aerospike_backup_service_backup_progress_pct` to identify which routine is running. To reduce CPU usage, stagger routine schedules so fewer backups overlap, or increase the CPU resources allocated to the ABS instance.
-   `process_resident_memory_bytes / 1024 / 1024 / 1024`
    
    -   Converts the resident memory (RSS) of the ABS process from bytes to gibibytes, which is easier to compare against any memory limits. A good practice is to alert at roughly 80% of the memory limit. If memory keeps growing, increase the container or host memory limit, or reduce the number of backup routines that run concurrently.

For details about these collectors in the Prometheus Go client, see the [collectors package documentation](https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/collectors).

## Endpoints

| Name | Description |
| --- | --- |
| `/metrics` | Exposes metrics for Prometheus to check performance of the backup service. |
| `/health` | Allows monitoring systems to check the service health. |
| `/ready` | Checks whether the service is able to handle requests. |
| `/version` | Returns the application version, commit hash, and build time. |
| `/api-docs` | Serves the API documentation in Swagger UI format. |

See the official [Kubernetes documentation](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) and [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/getting_started/) for more information.