Blog

What AI monitoring requires in production

AI systems fail quietly. Learn what effective AI monitoring measures, from LLM observability to agentic tracing, are and how to build infrastructure that holds under load.

January 8, 2026 | 18 min read
Alex Patino
Alexander Patino
Solutions Content Leader

AI systems fail quietly. Unlike a server crash or a broken API endpoint, a degrading language model still returns responses, but they're wrong, slow, or don’t do what they’re supposed to.  A fraud scoring model drifts over months until it starts approving transactions it shouldn't. An agentic AI pipeline stalls, not because any one component went down, but because latency compounded across a dozen chained operations until the response window closed.

This is the challenge of AI monitoring: failure modes are often invisible until they're expensive.

Organizations using AI in customer-facing workflows, real-time inference paths, and multi-agent systems need a different framework to see what’s going on than traditional application performance monitoring. This helps determine whether the system produces outputs that are accurate, timely, and do what they’re supposed to. 

Why traditional monitoring isn’t enough for AI systems

A standard application performance monitoring (APM) stack tells you whether an endpoint returned a 200 status code, but can’t tell you whether the response was correct. This distinction explains why AI monitoring is different. 

Traditional monitoring assumes that system behavior is predictable: Given the same input, a conventional application produces the same output. Alerts are binary: the service is running, or it isn't. Latency thresholds are fixed. Failures are caused by a broken dependency, a crashed process, or an exceeded memory limit.

AI systems, particularly large language models (LLMs) and agentic AI pipelines, don't work that way. They are unpredictable, because the same prompt may have different answers. Model performance degrades gradually through data drift, context window saturation, retrieval quality degradation in retrieval augmented generation (RAG) pipelines, or fine-tuning regressions, without any system-level signal that would trigger a traditional alert. A status code doesn’t monitor the hallucination rate or output relevance.

That’s why, when AI applications behave incorrectly in production, IT teams don’t always know right away. Users notice when the chatbot starts giving irrelevant answers, the recommendation engine stops finding relevant content, or the fraud model lets more bad transactions through. By the time the degradation shows up in business metrics, the problem has been accumulating for hours, days, or longer.

A 2025 McKinsey Global AI survey found that 51% of organizations using AI experienced at least one negative consequence, with nearly one-third of all respondents reporting consequences stemming from AI inaccuracy.1 Effective AI monitoring helps address this by using the content and quality of AI outputs as signals, alongside operational metrics that traditional monitoring already covers.

Aerospike vs. Apache Cassandra: Performance and resilience under mixed workload and node failure

Most database benchmarks test peak performance under ideal conditions. This one doesn't. Aerospike delivered 3.5x higher average throughput, sub-millisecond p99 latency, and stayed operational under node failure conditions that sent Cassandra into cascading failure. See the full results.

What AI monitoring measures

AI monitoring runs across two layers that traditional tooling conflates or ignores.

Operational performance metrics

The first layer covers what infrastructure monitoring already tracks, adapted for AI-specific workloads. Latency in an AI system is more complex than a single response time figure. For LLM-based applications, relevant metrics include time to first token, time per output token, and total generation time. Each matters differently: time to first token determines perceived responsiveness and time per output token affects throughput under concurrent load, while total generation time determines how much can be produced before reaching the latency limit.

Token usage, throughput, error rates, and GPU or compute utilization round out this layer. For API-based AI deployments, token consumption determines cost, and without monitoring it at the request and workflow level, organizations routinely discover months into production that one workflow is responsible for most of their budget. 

Output quality and behavioral metrics

The second layer is where AI monitoring diverges from conventional monitoring. Output quality cannot be validated in real time like a database query. For generative AI systems, quality assessment requires heuristics, user feedback signals, and often a secondary AI evaluation layer.

Metrics in this layer include hallucination rate, response relevance, toxicity detection, prompt injection attempts, and output faithfulness relative to retrieved context in RAG architectures. Drift detection belongs here too: tracking whether the distribution of outputs is shifting over time, certain query types are returning lower-quality results, or a model update made the model worse at certain tasks.

For agentic AI systems, this layer extends further still. One user interaction triggers dozens of dependent tool calls, memory reads, retrieval steps, and reasoning chains. Monitoring must trace the entire execution path, not just the final response, to identify where failures originated and why. Amazon has published one of the more detailed public evaluation architectures for production agentic systems, measuring quality across three distinct levels: whether the final answer was correct, whether the agent actually completed the assigned task, and whether individual tool invocations were executed with the right parameters, a framework that spans more than 20 distinct metrics in total.2

Agentic AI creates a monitoring challenge 

Agentic AI systems make monitoring harder.  Traditional AI monitoring, even for LLMs, focuses primarily on individual calls: Did this request return a quality response in an acceptable time? Agentic systems introduce dependencies that make this more complicated. 

When an AI agent turns a user request into a chain of tool calls, retrieval steps, and reasoning loops, the overall user experience is determined by the slowest and least reliable component in that chain. One slow database lookup or a retrieval quality issue in a mid-chain step invalidates the entire workflow. Moreover, when agents are coordinating with other agents at machine speed, even a 100-millisecond slip in one component breaks the reasoning chain before it reaches a conclusion. 

This creates two monitoring requirements that don't exist in simpler AI deployments.

End-to-end tracing across agent steps

Each step in an agentic workflow, including tool invocation, memory read, retrieval, and model call, must be monitored independently. Without step-level tracing, it's impossible to diagnose whether a degraded response resulted from poor retrieval quality, a slow external API, a model reasoning failure, or a prompt injection in an intermediate step. The trace must capture inputs, outputs, latency, and any errors at each point. 

OpenTelemetry is becoming the common way to collect and standardize monitoring data for AI systems, and many tools now agree on how to label and interpret that data.

Finding anomalies across compound workflows

Because agentic AI behavior is inherently variable, finding anomaly detection requires statistical baselining rather than fixed thresholds. A tool call that normally completes in 50 milliseconds and suddenly takes 800 milliseconds is an anomaly, but only if you have a baseline to compare against. Similarly, an agent that normally completes a workflow in four steps and suddenly requires twelve is showing behavioral drift that may indicate a prompt quality problem, a data quality problem, or a model regression.

McKinsey research found that roughly eight in ten companies attempting to scale agentic AI identified data-related gaps as an obstacle to production success.3 Without visibility into how data flows through agentic pipelines, including what was retrieved, when, and with what relevance score, teams cannot distinguish between a model failure and a retrieval failure.

LLM observability: monitoring what you can't measure

LLM observability is a subfield of AI monitoring focused on the specific challenges of language model behavior in production. The problem is that LLM performance cannot be assessed the same way as traditional machine learning model performance is assessed.

For a classification model, accurate information eventually becomes available: you predicted fraud, and the transaction either turned out to be fraudulent or it didn't. That feedback loop supports drift detection against known outcomes. For a generative AI application, accurate information is often unavailable or meaningless. Whether a summarization is "accurate" or a chatbot response is "helpful" depends on context, user expectations, and domain-specific standards that vary by use case.

What LLM observability tracks

LLM observability addresses this by adding richer signals to the full request-response cycle than infrastructure monitoring provides. The operational baseline, such as latency per step, token counts, cost per query, intermediate prompt and response logs, gives teams the raw material to diagnose performance issues. But that layer alone can't tell you whether the answers are any good.

Quality assessment sits on top of the operational layer. Relevance and faithfulness scores are determined with rule-based heuristics, reference evaluation sets, or a secondary model running as a judge, evaluating whether each response answers what was asked and whether its claims are supported by the retrieved context. Those scores, tracked over time, help find drift: a sustained decline in average faithfulness across a rolling window points to something upstream having changed.

User behavior is another method that automated scoring alone misses. When someone immediately rephrases a question, abandons a session, or escalates to a human, those actions carry information about output quality that no judge model captures. Correlating these behavioral patterns against quality scores, across the same time windows, helps find failures that aren’t quite hallucinations but still degrade the user experience enough to change how people behave.

Model drift and the case for continuous monitoring

Production AI systems degrade without any code change, infrastructure failure, or explicit trigger. A model that passes rigorous pre-deployment evaluation may behave differently six weeks later, because the users asking questions have changed, because the data is no longer up-to-date,  or because a provider shipped an updated model under the same API endpoint. This is a common risk in production AI: the system was evaluated and launched without a process to detect when real-world behavior stopped matching what was tested.

Unlike infrastructure failures, which typically cause specific incidents, model drift is gradual. Without continuous monitoring that tracks output quality over time, teams discover degradation through user complaints rather than through alerts.

PhonePe Customer Story

PhonePe is India’s leading fintech super app, processing over 100 million transactions daily and serving 380 million registered users and 30 million merchants. To maintain real-time performance and uninterrupted availability at this scale, PhonePe required an infrastructure capable of handling millions of transactions per second with ultra-low latency. Learn more about how PhonePe powers fraud detection and feature store lookups with sub-millisecond latency, even at peak scale.

Metrics for effective AI monitoring

Not all metrics matter the same, and the most valuable ones are often the least obvious. Here are examples of what to track.

Latency distribution, not average latency

Average latency is misleading for AI workloads under real-world conditions. What matters is the shape of the distribution, specifically the tail. A system with a median latency of 200 milliseconds but a P99 of 4 seconds produces inconsistent user experiences far more often than the average suggests. For agentic workflows where one user action triggers dozens of dependent operations, a high P99 on any individual step gets amplified across the chain.

Monitoring should track P50, P95, P99, and P99.9 latency separately, with alerts on the tail rather than the average.

Token usage and cost attribution

Without token-level monitoring, AI infrastructure costs become unpredictable. Long prompts, looping agents, or malformed inputs cause spikes in token consumption that don't show up until a billing cycle. The practical approach is to establish usage baselines per workflow and per user segment during early production, then set alerts on deviations from those baselines. This makes cost spikes detectable in near real-time rather than afterwards.

Hallucination and output accuracy rates

Hallucination rate is the measure most directly tied to user trust and downstream decision quality, and it's the hardest to monitor reliably. The current best practice is a multi-layer approach: automated scoring using a judge model that evaluates faithfulness to source documents, spot-checking by domain experts against representative samples, and user feedback aggregation to identify response patterns with high correction or abandonment rates.

No one method is enough. Automated scoring misses subtle hallucinations; expert review doesn't scale; user feedback is inconsistent and not always reliable. The combination, when correlated across the same time windows, produces a more reliable signal than any individual method.

Retrieval quality for RAG architectures

For AI applications built on RAG, retrieval quality is often the primary source of output failure and the most neglected monitoring target. Relevant signals are: 

  • Retrieval precision (Did the retrieved documents actually contain relevant information?)

  • Retrieval latency (Is the retrieval step adding unacceptable time to the overall response?)

  • Embedding drift (Are the semantic representations of queries and documents remaining stable over time, or is the gap between them widening?)

Tool call success rates for agentic systems

In agentic AI workflows, tools are how reasoning links to action. To diagnose agentic failures, monitor tool invocation success rates, error types, execution latency, and parameter accuracy. A tool that fails silently by returning a malformed response that the agent accepts without error corrupts a workflow chain without any obvious system-level failure.

AI governance and the regulatory context

AI monitoring is no longer something you just do in-house.  Regulatory frameworks are beginning to mandate it.

The EU AI Act, which took effect on August 1, 2024, makes monitoring a legal obligation rather than an engineering best practice for many AI deployments. Under the Act, providers must maintain active post-market monitoring systems, and organizations that run high-risk AI systems bear responsibility for ongoing operational oversight, not just at launch but throughout the system's production life.4 High-risk applications require at least six months of logs, and deployers are expected to report serious malfunctions to the relevant authorities without delay.5

This means organizations in regulated markets or deploying AI in high-risk categories such as credit decisions, healthcare, employment, or critical infrastructure need AI monitoring. The monitoring must be documented, auditable, and capable of finding incidents in time to report them to the relevant authorities.

What AI governance monitoring requires

Beyond compliance, AI governance monitoring addresses the question of whether AI systems are behaving in accordance with intended policy, not just intended performance targets. This includes tracking outputs for bias patterns across user subgroups, detecting prompt injection attempts that could cause a model to violate policy boundaries, monitoring for sensitive data exposure in model outputs or logs, and maintaining audit trails of model decisions in high-stakes contexts.

The distinction between performance monitoring and governance monitoring is important. A model may perform well on quality metrics while systematically producing outputs that are biased against certain groups, or that inadvertently expose sensitive data. Governance monitoring specifically targets these,  which require different instrumentation and evaluation methods than performance monitoring does.

Data quality as a governance concern

Agentic AI’s data problem is more than availability. Treating data quality as something you fix periodically, rather than something you monitor continuously in real time, keeps agents from delivering value. For AI systems that depend on live data to generate timely and accurate outputs, data quality is both a performance issue and a governance issue. Stale, incomplete, or inconsistent data coming into an AI system provides worse answers and introduces systematic errors difficult to find without data monitoring running alongside model monitoring.

Best practices for AI monitoring in production

You don’t get effective AI monitoring with one tool and by configuring dashboards. It requires architecture decisions made before production deployment.

Instrument from the start

Monitoring infrastructure should be built into the AI application from the beginning, not bolted on after issues emerge. Retroactively adding observability to an AI system already in production is harder than designing for observability from the beginning. This means selecting an instrumentation framework, typically OpenTelemetry, before writing the first production inference call, establishing what constitutes a "good" output during the evaluation phase so baselines exist when the system goes live, and defining quality metrics before deployment rather than constructing them in response to incidents.

Separate operational and quality monitoring

Telemetry pipelines for infrastructure metrics and output quality signals should be separate, even when they use a unified dashboard. 

Infrastructure metrics such as latency, throughput, and error rates are high-frequency, structured, and work well with time-series databases and existing APM infrastructure. 

Output quality signals such as relevance scores, hallucination rates, and user feedback are lower-frequency, detailed, and specific, and often require secondary processing before they can be stored as metrics. Combining the two leads to either undersampling infrastructure data or overwhelming it with data that isn’t ready to be measured.

Build evaluation into deployment gates

Model updates, prompt template changes, and updates to the knowledge base should all be tested and checked before using them in production. The evaluation suite should represent the actual production input distribution, not a curated test set that reflects best-case conditions.6

The practical standard that has emerged across the AI engineering community is a layered evaluation architecture. Deterministic checks, format validation, schema compliance, regex-based rules, run on every pull request and act as a fast gate before any change reaches production. These checks run in milliseconds and are designed to catch structural failures immediately, before slower evaluation methods are invoked.7

A deeper regression suite then runs against a golden dataset: a curated set of real inputs and expected behaviors that reflects how the application actually fails, not how it performs under ideal conditions. Any pull request that touches a prompt template, model version, or retrieval configuration triggers this suite, and a change that regresses quality past an acceptable threshold does not merge.8

The third layer operates continuously in production, sampling live traffic, scoring outputs against quality dimensions, and alerting when those scores drift from established baselines. This is the only layer that catches degradation caused by changes outside the team's control: shifts in user input distribution, upstream model updates, or retrieval corpus staleness.

Trace agentic workflows end-to-end

For multi-agent and multi-step AI systems, tracing must capture the entire execution graph rather than individual model calls. Every tool invocation, memory access, retrieval step, and model call should be instrumented with a shared trace context to track down the root cause of any failures. Without end-to-end tracing, debugging an agentic AI failure requires reconstructing what happened manually, which is expensive and unreliable. 

Set latency budgets at the workflow level

For user-facing AI applications, set latency budgets at the end-to-end workflow level rather than at individual component levels. One model call that takes 500 milliseconds may be acceptable. When that call is embedded in a workflow with eight other steps, each adding its own latency, the combination may be more than users can deal with. Workflow-level latency budgets help teams determine which steps tolerate latency and which don’t.

Emerging technology: The projected Total Economic Impact™ of the Aerospike NoSQL data platform

Aerospike's real-time NoSQL database was found to deliver a projected ROI of 446% to 574%. Discover even more findings within this report.

The data layer problem in AI monitoring

One of the most difficult aspects of production AI monitoring is the data layer. AI monitoring creates a lot of data, such as traces, logs, quality scores, and user feedback signals, that must be stored, queried, and analyzed quickly to be able to do anything about it.  Alerts that take minutes to show up on a degrading model often arrive too late to prevent affecting users. 

This creates a conflict between how much monitoring data is produced and how fast alerts need to be delivered. A monitoring system that buffers telemetry for batch processing finds trends but cannot prevent individual incidents. Real-time alerting means telemetry has to be ingested, processed, and evaluated against thresholds at the same speed as the work the AI system is doing. 

The problem is worse in agentic AI systems, because one user session generates dozens of trace events, tool call records, and quality scores within seconds. With thousands of concurrent sessions, the monitoring system may create more data than it’s monitoring.

For AI teams, the data infrastructure supporting monitoring is often as important as the AI infrastructure being monitored. To find anomalies in real time requires read latencies in single-digit milliseconds.  Write throughput must handle bursty telemetry streams without sampling that would cause the monitoring system to drop events when the production system is monitoring matters most.

Storage architectures that perform well under predictable, steady-state query loads but degrade when they’re busy or under sudden telemetry bursts create the same class of failure as the AI systems they're meant to monitor: They appear to work until they don't, and they fail at the worst time.

Predictability

Production AI monitoring fails in a specific way: not necessarily when the AI system breaks, but when the monitoring system itself becomes unreliable when it's needed most.

When a production AI application starts generating alerts, such as model quality degrading, latency spiking, or an agentic workflow failing silently, the monitoring infrastructure is now under load. It is reading monitoring data from a system behaving abnormally, often with more data than normal.  This is when monitoring systems built on data infrastructure that degrades under pressure begin to drop events, buffer telemetry, or produce alerts with timing that varies unpredictably from incident to incident. By the time the alert arrives, the damage is already done.

So, the AI monitoring system’s reliability is limited by the tail latency behavior of the infrastructure beneath it. A telemetry store that delivers consistent read latency under normal load but degrades at high utilization produces inconsistent alert timing. An ingestion pipeline that handles steady-state write volume but backs up under bursty telemetry streams drops events. In both cases, the monitoring system looks like it’s still working, but it isn’t. 

The standard that matters is not whether a monitoring system works when everything is fine, but whether it remains instrumented, accurate, and responsive when the production environment is under stress, which is when that standard gets tested.

Aerospike and AI monitoring infrastructure

Teams running production AI systems with real-time inference paths, agentic pipelines, and feature stores routinely discover that the bottleneck is the data layer underneath it, with the telemetry stores, context stores, and retrieval systems that must remain fast and responsive when production load is highest, and access patterns are least predictable.

Aerospike is built for this operating environment. Its architecture maintains consistent behavior as workload changes, as utilization increases, and as systems run near capacity rather than comfortably below it. In other words, the conditions that break the conventional data infrastructure. 

If the data infrastructure requirements that production AI creates are relevant to what you're building, explore how Aerospike handles them in practice.

Try Aerospike Cloud

Break through barriers with the lightning-fast, scalable, yet affordable Aerospike distributed NoSQL database. With this fully managed DBaaS, you can go from start to scale in minutes.

Frequently asked questions about AI monitoring

Find answers to common questions below to help you learn more and get the most out of Aerospike.

AI monitoring continuously observes an AI system's performance, behavior, and outputs in production to detect issues, measure quality, and ensure the system is operating as intended. Unlike traditional application monitoring, which focuses on infrastructure health, AI monitoring also tracks the content and quality of model outputs, such as hallucination rates, output relevance, model drift, and response latency across complex, multi-step workflows.

Monitoring tells you whether predefined performance targets are being met. Observability combines monitoring with tracing and evaluation to explain why problems occur. For AI systems, monitoring might tell you that response latency has increased. Observability shows you which step in the inference chain caused the increase, what the input looked like, and whether the output quality also degraded. Both are necessary for production AI systems.

LLM observability instruments large language model applications in production to track not just infrastructure health, such as latency, token usage, and error rates, but also look for output quality signals that indicate whether the model is producing accurate, relevant, and safe responses. Because LLM behavior is unpredictable and drifts over time without failing, LLM observability requires a combination of automated quality scoring, user feedback signals, and drift detection that doesn't exist in standard APM tooling.

The most important metrics depend on the application type, but include:

  • End-to-end latency (especially P99 and P99.9)
  • Token usage and cost per workflow
  • Hallucination or error rate
  • Retrieval quality for RAG applications
  • Tool call success rates for agentic systems
  • Output quality scores from automated evaluation

For user-facing applications, connecting these signals to user experience metrics such as session abandonment, feedback rates, and re-query rates also helps.

Agentic AI systems require end-to-end tracing across all steps in the workflow, including every tool call, retrieval step, memory access, and model invocation, rather than per-request monitoring. Because one user interaction triggers dozens of dependent operations, failures must be diagnosed at the step level rather than the workflow level. Latency problems are also more severe: a step that adds 100 milliseconds by itself may add 800 milliseconds of user-visible delay when it's embedded in a ten-step chain.

The EU AI Act requires providers of AI systems to maintain post-market monitoring systems and report serious incidents. Organizations running high-risk AI systems must monitor system operation continuously and retain automatically generated logs for at least six months. High-risk categories include AI systems used in critical infrastructure, employment, credit, healthcare, and law enforcement applications. Full enforcement applies to most high-risk systems from August 2026.

Model drift detection requires establishing quality baselines during the evaluation phase before deployment, then monitoring production outputs against those baselines continuously. Effective drift detection tracks output quality scores over rolling time windows, tracks how user queries change over time to see if those changes might make the model work worse or differently, and uses embedding-based analysis to find when user questions start meaning something different from the kind of data or tasks the model was trained or tuned for. Automated judge models, user feedback aggregation, and retrieval quality metrics for RAG systems help track drift.

AI failures often emerge gradually, through model drift, lower data quality, or changes in the kinds of inputs a system receives over time, rather than through specific incidents. Without real-time monitoring, it may take days or weeks between when a failure begins and when it becomes visible in business outcomes. Real-time monitoring reduces that gap, helping teams intervene before degraded AI outputs affect more people or lead them to make bad decisions. For agentic AI systems where individual interactions are important, real-time visibility is even more important.

Footnotes

  1. McKinsey & Company, "The State of AI," QuantumBlack by McKinsey, https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

  2. Vinnie Smandava, "The Agentic AI Infrastructure Landscape in 2025–2026: A Strategic Analysis for Tool Builders," Medium, https://medium.com/@vinniesmandava/the-agentic-ai-infrastructure-landscape-in-2025-2026-a-strategic-analysis-for-tool-builders-b0da8368aee2

  3. McKinsey & Company, "Building the Foundations for Agentic AI at Scale," McKinsey Technology insights, https://www.mckinsey.com/capabilities/mckinsey-technology/our-insights/building-the-foundations-for-agentic-ai-at-scale

  4. European Commission, "Regulatory Framework for AI," Shaping Europe's Digital Future, https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai

  5. European Union, "Article 26: Obligations of Deployers of High-Risk AI Systems," EU AI Act, https://artificialintelligenceact.eu/article/26/

  6. Galtea, "LLM Evaluation: The Complete Guide," Galtea blog, https://galtea.ai/blog/llm-evaluation-complete-guide

  7. Latitude, "The Ultimate CI/CD LLM Evaluation Guide," Latitude blog, https://latitude.so/blog/ultimate-ci-cd-llm-evaluation-guide

  8. Adaline, "The Complete Guide to LLM and AI Agent Evaluation in 2026," Adaline blog, https://www.adaline.ai/blog/complete-guide-llm-ai-agent-evaluation-2026