Blog

AI agents in production: What they cost, when they fail, and when not to use them

Learn what AI agents cost, how they fail in production, and when a workflow or single model call is the better choice. A practical guide for builders.

December 4, 2025 | 17 min read

Alexander Patino

Solutions Content Leader

An AI agent is a system in which a large language model directs its own process. It decides which tools to call, in what order, and when the task is done, rather than following a sequence a developer wrote in advance.

It uses artificial intelligence, but is more specific than an AI assistant or a chatbot. Where an assistant responds to each prompt in turn, an intelligent agent pursues a goal across many steps. Workflows orchestrate models and tools through predefined code paths, while agents let the model control how it accomplishes the task.

The distinction matters because it is architectural. A script that calls an LLM is still a script. What makes an AI system agentic is the transfer of control over the path from the developer to the model.

Agentic AI does this in three ways:

The model plans, breaking a goal into steps.
It calls tools, reaching outside its training data to search, query, or act.
It carries some memory of what it has already done, so later steps build on earlier ones.

If you’re considering whether to build one, remember that "agent" describes a design choice with consequences. It’s important to know when those consequences are worth it.

Agent, workflow, and the problem of agent-washing

The most common question practitioners ask is whether a given system is actually an AI agent. It determines what the system does, what it costs to run, and how it fails.

A useful rule of thumb is to use an agent when the path to the goal is uncertain, and a workflow when the path is known.¹ If you can write down the steps, write them down. A specific pipeline is cheaper, faster, easier to test, and easier to debug than a model improvising its way to the same result. The agent is most useful when the steps cannot be specified in advance, and when the system has to decide what to do next based on what it finds.

Consider a translation feature that takes text, sends it to a model, and returns the result. That is roughly ten lines of code against an API. Making it an AI agent involves planning and AI tool selection, which adds cost and potential problems while solving a task that the simple version already solved.

Now consider an open-ended research task: gather information on a topic, follow leads as they appear, and synthesize a result. The path here cannot be written in advance, because each finding affects the next query. This is where AI agent complexity is worthwhile.

Between these two cases is agent-washing. Vendors sometimes apply the term to simple scripts and fixed workflow builders because it sounds better. That results in a plethora of different platforms. Buyers fear paying agent prices for workflow capability, and builders fear over-engineering a problem that a loop and an API call would have solved.

So, before accepting that a system needs to be an agent, ask whether the path is uncertain. If it is not, use the simpler architecture.

The five agent types

AI agent taxonomy runs from simple to sophisticated.

Reflex agents, sometimes called reactive agents, act on the current input against fixed rules.
Model-based reflex agents maintain an internal picture of a world they cannot fully observe.
Goal-based agents plan toward an objective.
Utility-based agents weigh competing options against a measure of value.
Learning agents improve from experience.

Each one is more capable and autonomous.

That doesn’t mean it’s more reliable. Research from METR measuring how AI agents work and handle tasks of increasing length found that gains in raw capability translate into only modest gains in reliability.² A more capable AI agent attempts more, but the additional autonomy that lets it attempt more also means it fails in less predictable ways.

That means selecting a more sophisticated agent type is not a path to a more trustworthy system. A goal-based agent with broad latitude produces a wider range of outcomes than a constrained reactive agent, and a wider range of outcomes includes a wider range of failures. When evaluating a design, treat capability and reliability as separate questions.

Why your demo works and production fails

The biggest complaint about agentic AI is that it performs in a demo and fails in production. An AI agent that works on stage falls apart when a production system introduces an option that the demo never showed.³

Here’s why. AI agents are non-deterministic, which means the same input produces different outputs on different runs. A demo is one run that worked. Production is thousands of runs, and the variation that wasn’t noticeable in the demo shows up during production. Traditional continuous integration assumes that a passing test stays passing, but when the output of these autonomous agents changes from run to run, a test that passed once doesn’t tell you anything about the next one.

The problem gets worse as tasks lengthen. METR found that the maximum length of task an AI agent completes with roughly 50% reliability has been increasing exponentially. The original March 2025 paper estimated this capability doubled about every 212 days, while a January 2026 update revised the estimate to about 131 days. This is progress.⁴

But success rates fall sharply as tasks stretch from minutes into hours. A long task is a chain of steps, and the probability of completing the chain is the product of the probabilities of each link. Small per-step error rates become large end-to-end failure rates. An AI agent that is right 95% of the time at each step will complete a 20-step task only a third of the time.

A demo proves the system succeeds, but it does not measure how often it will, or how badly it fails when it does not. Those are the questions production asks, and they require a different kind of measurement.

(Webinar) Architecting for in-memory speed with SSDs -- 80% lower costs, same performance

Discover how Aerospike’s Hybrid Memory Architecture (HMA) uses high-speed SSDs to deliver in-memory performance at a fraction of the cost. Watch the webinar to explore the design behind sub-millisecond reads, massive scale, and unmatched efficiency.

Watch now

Evaluating a system that behaves differently every run

If the same input results in different outputs, one accuracy number doesn’t mean much. It records what happened on the runs you measured, not what will happen on the runs you ship. Unlike a static artificial intelligence model that returns one answer to one query, an AI agent runs a variable path each time, so evaluating it requires measuring properties other than mean success rate.

Evaluation is based on four properties:

Consistency is how much the output varies across runs of the same input.
Robustness is how well behavior holds when conditions shift slightly, when a prompt is rephrased, or a model is updated.
Predictability is whether the range of outcomes is known in advance, even when the specific outcome is not.
Failure severity is the difference between the best and worst things the system does when it goes wrong.

Together, these describe operational reliability, which is different from average correctness.

Failure severity is particularly important because success rate doesn’t measure it. A formatting error and an unauthorized destructive action both register as a failed run, but they are not the same thing. An agent that occasionally returns a misformatted answer is an inconvenience. An agent that occasionally deletes a file or takes an action it was not authorized to take is a liability, regardless of how rarely it happens. Weight outcomes by what they cost rather than counting them as equal events.

A system that posts a higher mean score is not necessarily the better choice if its failures are more severe or its outputs are less consistent. You need to define, for the specific deployment, which of the four properties matters most, and evaluate against that. An agent that drafts internal summaries tolerates variation that an agent who approves expense reports cannot.

What agents actually cost

The economics of modern AI agents surprise teams. While a chatbot answers a question in one exchange, an AI agent runs a loop, calling the model repeatedly as it plans, acts, observes, and revises. Each pass through the loop spends tokens.

This adds up. Anthropic reports that one agent typically uses about four times the tokens of a chat interaction, and that a multi agent system uses about 15 times as many.⁵

For coding agents specifically, research from the Stanford Digital Economy Lab found that agentic tasks used as many as a thousand times more tokens than ordinary code chat or reasoning, and that two runs of the same task could differ in total token use by as much as 30 times.⁶

Moreover, spending more tokens did not reliably produce more accurate results.⁷ The last finding is the one that undermines the natural instinct to throw more processing at a struggling agent: It increases variance as readily as it increases quality.

What that means is a proof of concept that costs $50 to run becomes a monthly bill ten times larger than the team projected.

Two related mistakes actually make it worse:

Reasoning in circles: an agent that fails to recognize when it is done will keep calling tools, burning tokens on a loop that produces nothing.
The assumption that a large context window solves problems by accommodation. Filling a context window costs money and latency, and beyond a point, it’s less accurate as well, because models are less reliable with information buried in the middle of a long input.

So how do you deal with it?

Model token consumption per workflow before the architecture is fixed, not after the bill arrives.
Estimate how many model calls a task requires and how large each call's input grows as the agent accumulates context.
Set budget limits and alerts.
Route easy subtasks to smaller models and reserve the expensive model for the steps that need it.

Protecting against prompt injection

Security changes character when an AI agent gains the ability to act. A generative AI chatbot manipulated into saying something it should not say has produced bad content. An AI agent with tool access, manipulated the same way, has sent data, deleted a record, or made a request on the user's authority. This is what separates agentic systems from the models they are built on, and the vulnerability that enables the abuse is prompt injection. It resists tidy fixes because of how language models read.

A model receives instructions and data in the same stream of tokens and has no reliable way to tell which is which. Text that arrives as data, a web page the agent fetches, a document it reads, an email message it processes, carries instructions that the model follows as if they came from its operator. Security researchers have compared this to cross-site scripting, the web vulnerability class that comes from confusing code with content, and the comparison is apt: in both cases, the system cannot distinguish the trusted from the untrusted because they share a channel. The OWASP project that catalogs language-model risks ranks prompt injection at the top of its list and is explicit that neither retrieval augmentation nor fine-tuning mitigates these vulnerabilities.⁸

Two properties make the agentic version worse than the chatbot version.

Confused-deputy structure: The agent acts with privileges the attacker does not have, so an injected instruction borrows the agent's authority.
Amplification. One injected instruction triggers many actions at machine speed, turning one poisoned input into thousands of operations before anyone notices. Indirect injection, where the malicious text is planted in a source that the agent will later read rather than typed by the attacker directly, makes the delivery hard to anticipate.

Because the vulnerability is architectural, mitigations are architectural too, and they are about containment rather than prevention:

Grant the agent the least privilege its task requires, so a successful injection commands less.
Separate untrusted content from privileged instructions wherever the design allows.
Require human intervention and explicit approval before any irreversible or high-impact action.
Constrain tools to allow lists, and log what the agent does so that an incident can be reconstructed.

None of these stops injection outright. What they do is limit the damage, which, for an architectural vulnerability, is the realistic goal. Treating prompt injection in agentic AI as a patchable defect means waiting for a fix that the structure does not permit.

Single agent vs. multi-agent: The most expensive default

Multi-agent systems, in which several AI agents divide a task and coordinate, seem more sophisticated, but it leads teams to reach for the multi-agent design first. The consensus among practitioners who have shipped both is that starting with multi-agent is more expensive.

There’s evidence on both sides, which is why making informed decisions is prudent in this regard. On one hand, multi-agent systems outperform single agents on the right kind of work. A research system using a lead agent coordinating subagents beat its single-agent configuration by a wide margin on an internal evaluation, due primarily to the amount of work the parallel agents could do, according to Anthropic.⁹

On the other hand, Anthropic also warns that domains requiring all agents to share context, or involving many dependencies between agents, are not a good fit for multi-agent systems today because LLM agents are not yet good at coordinating and delegating to other agents in real time.¹⁰

A multi-agent system is good for work that breaks into independent, parallel pieces with little need for the agents to stay synchronized, such as read-heavy research, broad search, and other tasks where each worker operates independently without waiting on the others. Multiple AI agents struggle when the pieces are tightly coupled and when they must share a common understanding that drifts as each maintains its own, because coordination overhead uses up more reasoning capacity than multi-agent adds.

What matters most is specification quality. Autonomous agents working in separate contexts diverge unless the task is specified clearly enough that they do not have to negotiate what it means.

A better strategy is to build the single-agent version first, measure it, and add agents only when the evaluation data shows the work is parallelizable and the single agent is the bottleneck. Reaching for multiple agents because they’re cool, before the data justifies it, adds coordination cost and correlated failure.

The environmental costs of Redis server sprawl

Learn how Aerospike’s efficiency reduces server sprawl while optimizing environmental and operational costs.

Read now

The ROI reality: Why most pilots stall

The question underneath all the others, for anyone funding this work, is whether AI agents deliver a return or the category is running on hype.

Let’s start with a number that has become the rallying point for skeptics. A 2025 report from MIT's NANDA initiative, studying gen AI deployments in business, found that 95% of pilots delivered no measurable impact on profit and loss, and that only a few integrated systems created significant value.¹¹

The number is stark, but there are a couple of points to keep in mind:

The figure covers generative AI pilots broadly rather than AI agents alone.
It comes from just one report.

With those caveats in place, the report's diagnosis is useful. It attributes the failures not to weak models but to gaps in integration and organizational learning. The pilots did not fail because the technology could not do the task. They failed because the surrounding system, the data plumbing, the workflow fit, and the institutional capacity to use a new tool were not ready.

The same body of work found that internally built systems succeeded at roughly half the rate of systems sourced from specialized vendors. This implies a default for teams without deep in-house experience: Buy the narrow capability rather than build the general one, at least until the organization has learned enough to know what it needs. And the expectation of a six-to-12-month payback is itself a cause of failure, because it leads teams to declare defeat before the integration work that determines success has been done.

The failure rate is a statement about readiness, not the technology. Systems that succeed tend to pick one narrow, high-pain process, keep human oversight in the loop while the system earns trust, set a realistic horizon, and measure leading indicators, the rate at which pilots reach production, and the depth of adoption, rather than demanding immediate profit.

When not to use an agent

In many cases, the right number of agents is zero. The strongest move available to a team is often to recognize that the problem in front of it does not need an agent at all. Here’s why:

If the path to the goal is known, a workflow is cheaper and more reliable.
If the task is short and well-defined, one model call or a deterministic script will outperform an agent on cost, latency, and predictability.
If failure severity is high and cannot be contained, the unpredictability that defines agentic AI is a liability rather than a feature.
If the budget cannot handle token consumption several times that of a chat system, it’s not cost-effective.
If the task requires several agents to share a constantly changing context, the multi-agent design won’t work well.

Each of these is a reason to choose the simpler path.

Here’s the way to go:

Start with the cheapest architecture that could plausibly solve the problem: no model, then one call, then a fixed workflow with a model in it, then one agent, and only then a multi-agent system.
Move up a step only when evidence from the step below shows it is insufficient.

This keeps cost, potential failures, and debugging difficulty as low as the problem allows, and it means any added complexity is paying for itself with measured improvement.

This way, you build fewer agents, but ship more working systems. Agents are a capable architecture for a specific situation, and the discipline that separates the deployments that work from the ones that stall is knowing what that situation is, and not using the architecture when it is absent.

Predictability as the real frontier

Originally, we wanted to know what agents can do, and the demos answered it. Now the question is whether agentic AI can do it predictably, affordably, and safely at production scale. That is a harder question, and it is as much an infrastructure question as a model one.

Systems that cross from pilot to production are not the ones with the most capable models. They are the ones built on an honest accounting of cost, an evaluation discipline that measures consistency and failure severity rather than an accuracy number, a security posture that treats injection as a containment problem, and the judgment to deploy agentic AI only where the path is uncertain. What separates the few that deliver value from the majority that stall is rarely the model, but the operational foundation underneath it.

Aerospike and AI agents

Several of the problems described here are based on the data layer the agent uses. The high token bills from agents re-reading context they should already hold, the system where behavior drifts between runs because its state is not durable, the multi-agent design that falls apart when shared context goes stale, and the tail latency that compounds across fan-out are all questions about how an agent stores, retrieves, and trusts its own state.

Infrastructure cannot make an agent reason correctly. What it determines is whether it works correctly at production scale.

An agent is only as reliable as the memory and feature data it reads on each step, and that data has to stay correct and fast under production conditions: volatile, non-repeating access patterns; unpredictable fan-out across tool calls; and working sets that outgrow memory. Cache hit rates deteriorate under highly dynamic access patterns, which is the problem a caching-first serving layer has when agent traffic stops repeating.

And while many agents’ memory tolerate eventual consistency, cases that do not are the high-stakes ones: tool coordination, transactional state, and any output that feeds a decision or a user-facing action that cannot be taken back.

Aerospike is a real-time database built for that situation. It holds agent state, session context, feature data, and embeddings with strong consistency and predictable latency at the tail and not just the average, and it does so on a patented Hybrid Memory Architecture that avoids the cost of keeping everything in RAM. It’s used by teams running real-time decisioning under strict latency budgets, including fraud engines that must return a verdict within a fixed window. What separates the systems crossing from pilot to production is a data layer that stays predictable when the layers above it do not.