Blog

Predictable performance in fan‑out architectures

Learn why request fan-out amplifies tail latency and how predictable data layer behavior keeps user experiences consistent as systems scale and dependency depth grows.

December 19, 2024 | 12 min read

Alexander Patino

Solutions Content Leader

Fan‑out architecture is a design pattern where one request or event is distributed into multiple parallel operations or messages. In practical terms, one user action triggers many calls to different backend services or database nodes. This pattern is common in distributed systems.

For example, the Amazon e-commerce website reportedly calls on an order of a hundred microservices to assemble one page. Netflix has noted that each API request from a client triggers an average of about half a dozen calls to its backend services.

These examples illustrate that fan‑out is normal for large applications. By fanning out work, systems consult many data sources or perform tasks concurrently, which is often essential for speed and scale. This approach also supports decoupling: different services or components handle pieces of a request independently, improving modularity. In event-driven architectures, fan-out also appears in publish/subscribe messaging patterns; one event gets fanned out to multiple subscriber services.

In short, enterprises rely on fan‑out architectures to gather data from disparate sources, to increase throughput via parallelism, and to build more responsive experiences by doing many things at once. However, along with these benefits come challenges that must be addressed in the architectural design.

The performance challenge of fan‑out

While fan‑out supports high throughput and rich functionality, it also introduces a performance challenge: the more operations one request depends on, the greater the chance that one of them will be slow or will fail. The overall response time seen by a user is determined by the slowest of the parallel operations. Even if each backend call is fast on average, a large number of calls means there’s a high probability of hitting a tail latency outlier on one of them.

For example, if a service has a 99th-percentile latency of 1 second (meaning 1 in 100 requests is that slow) and an application needs to wait on 100 such calls in parallel, it’s likely that some call in almost every user request will be an outlier. In fact, mathematically, about 63% of those composite requests would exceed 1 second in this scenario.

In a real Google system, researchers observed that increasing the fan-out of a query caused the 99th-percentile latency to balloon from 10 ms for one backend to 140 ms when hundreds of parallel calls were required, with a few slow responses contributing disproportionately to the delay.

In general, as fan‑out grows, even rare performance hiccups, such as slow disk access, a brief network blip, or a stalled process, degrade a number of user requests. This tail-amplification effect makes latency unpredictable: One user action might complete in 50 ms, and the next action, even hitting the same systems under similar load, could take ten times longer due to one sluggish dependency.

High fan‑out also multiplies the system load, because an influx of users means even more internal operations. Without careful design, this leads to queuing delays, saturated resources, or timeouts cascading through the system. Fan‑out architectures tend to be only as fast and reliable as their slowest constituent part, so they require strategies to keep every part performing predictably under heavy parallel workloads.

Customer story: Wayfair

Wayfair is one of the world’s largest online destinations for furniture and home goods, offering more than 30 million items from over 20K suppliers. With millions of daily visitors and a dynamic fulfillment model, Wayfair’s competitive edge depends on providing fast, accurate delivery promises at the top of the customer funnel.

See how Wayfair delivers accurate promises at scale.

Architectural strategies for high fan‑out systems

Designing for high fan‑out requires a focus on maintaining predictable, low latency across many parallel operations. Equally important is controlling the expansion of load and complexity that comes with spreading work across numerous components. Here are several strategies and principles that enterprise architects use to address the challenges of fan‑out.

Design for predictable low latency per operation

When one user request involves dozens or hundreds of calls, each of those calls must exhibit consistently fast response times. Any variability will be magnified when they’re combined. Therefore, a foundation for fan‑out architecture is choosing technologies and designs that reduce latency jitter and tail latency. This involves using high-performance data stores and networks optimized for consistency under load.

For instance, systems should be engineered to avoid long garbage collection pauses, lock contention, or other events that cause occasional stalls. The goal is to tighten the latency distribution of each component for not just a low average latency, but a narrow spread so that 99th-percentile, and even 99.9th, response times remain low, aiming for latency tail-tolerance.

Some platforms do this by bounding queues and prioritizing interactive requests, by scheduling background tasks in ways that reduce interference, and by using real-time optimizations to reduce variance. Even hardware choices, such as high-speed SSDs or keeping important data in memory or near-memory storage, help keep any given data access from being a slow outlier. Every layer, from CPU scheduling to the database engine, should be optimized to deliver consistent performance. By keeping the slowest operations as rare and as short as possible, the fan-out pipeline becomes more predictable for the user.

Reduce the number of dependent operations

Another way to reduce the effect of fan‑out is to avoid doing unnecessary fan‑out in the first place. In practice, this means designing data models and APIs that retrieve or update what you need with fewer calls.

One approach is collocating data that is often accessed together. For example, instead of a user profile request triggering separate lookups to user info, preferences, and settings services, those pieces of data might be stored or cached together so one query returns all of them. Databases offer features such as collection data types or document models that store related items in one record or document, which reduces the number of reads required.

Batch APIs and multi-get requests follow a similar logic: They allow a client to request many items in one round trip instead of issuing one call per item. By collapsing what would have been dozens of back-and-forth operations into one operation, fan-out is less wide, reducing the overhead of network hops and context switches.

In cases where business logic requires conditional or iterative calls, pushing that logic closer to the data also helps. Rather than fetching a large dataset into an application server and then filtering or aggregating it, incurring many calls or a large data transfer, a system supports server-side processing through stored procedures, user-defined functions, or query languages that execute filter conditions in the data store. This eliminates additional fan-out that would occur if the application had to fetch intermediate results and then make further calls based on those. By doing more work per call, the overall number of calls shrinks.

A related high-level strategy is fan-out on write instead of fan-out on read, in scenarios where many readers need the same computed data. One example is generating a social media feed. Fan-out on read computes a user’s feed by pulling posts from all their followers whenever the user opens the app, which results in hundreds of queries at read time. On the other hand, fan-out on write computes and stores updates to each follower’s feed as soon as someone posts, so reading the feed is a simple lookup.

The write-heavy approach precomputes results, trading off more work at update time to reduce read fan-out. This results in consistently fast reads because much of the work is done ahead of time.

However, it pushes the fan-out problem to the writes. In the feed example, a celebrity with millions of followers generates many writes on each new post. Real-world systems often adopt a hybrid model: they fan out on write for typical cases, but for extreme cases, such as users with large audiences, they might fall back to computing some parts on read.

In all cases, the guiding principle is to keep the fan-out bounded and manageable. By structuring how and when data is fanned out, architects keep any one request, whether a read or a write, from ballooning into many operations that could overwhelm the system.

Webinar: Powering real-time personalization with recommendation engines

Want to turn real-time data into real-time conversions? Watch our webinar, Powering real-time personalization with recommendation engines, and discover how teams at Wayfair, PayPal, and Myntra are using live behavioral signals and machine learning to deliver sub-millisecond personalized experiences, boosting engagement, loyalty, and revenue across every touchpoint. Learn how to scale personalization with lightning-fast performance.

Watch now

Use asynchronous processing and isolation

Not every task triggered by a user action needs to be completed while the user waits. A common pattern in high-scale systems is to offload certain fan-out operations to run asynchronously, using messaging and event-driven workflows.

For instance, consider an e-commerce order placement that needs to trigger downstream updates: inventory updates, sending a confirmation email message, logging analytics, and updating recommendations. Instead of the checkout service calling each of those services synchronously, and making the user wait for all of them, it publishes an “Order Placed” event to a message broker. Multiple consumers, such as an inventory service, email service, and analytics service, subscribe, and each reacts to that event on their own time.

This pub/sub fan-out decouples the originator from the downstream processes. The user-facing request of placing the order completes quickly, and the fan-out of side tasks happens in the background. Using message queues and topics for fan-out also adds buffering and resilience; if one consumer service is slow or briefly offline, messages queue up rather than slow down the origin. Asynchronous fan-out preserves responsiveness and isolates failures

or slowness to the affected component.

Enterprises often use managed messaging systems or streaming platforms to do this, because these systems handle the delivery to multiple subscribers reliably and at scale. The result is a more resilient architecture because high fan-out workflows are handled outside the critical path of user interactions, and services are loosely coupled through the event broker, reducing direct dependencies. This strategy does not eliminate fan-out, but moves it to a layer where it is managed and scaled independently, protecting the user experience from variability.

In designing such systems, it’s important to set up monitoring and back-pressure mechanisms. For example, if a downstream consumer falls far behind, the system might need to throttle event publishing or spawn additional consumer instances, but these are solvable problems with well-established patterns. With asynchrony, architects avoid holding up the user; instead, many fan-out operations happen “offline” from the user’s perspective, improving perceived performance and reliability.

Graceful degradation and tail-tolerance techniques

Even with optimizations, in a highly fanned-out system, there will be times when one of the many paths is slow or fails. Robust architectures plan for this with graceful degradation and tail-tolerance techniques. Graceful degradation means the system delivers a useful response to the user even if some components don’t respond in time.

For example, if an e-commerce page’s recommendation service is running slowly, the page gets returned without personalized recommendations rather than making the user wait or causing an error. This is often implemented with timeouts and fallback logic: Each backend call is given a strict time budget, and if it doesn’t complete, the system uses a default or skips that portion of the result. By capping waiting time for each sub-request, overall response time becomes more predictable and does not get held up by one straggler.

Similarly, circuit breaker patterns stop routing requests to a persistently failing service, preventing cascading slowdowns across hundreds of threads waiting on something that’s not working. These approaches mean that one problematic component has a limited impact.

Another set of techniques reduces the effect of tail latency. One such method is hedged requests, where the system preemptively issues duplicate requests to multiple backend nodes, or re-issues a request if it’s slow, and uses the first response that comes back. This reduces the tail of the latency distribution at the cost of a bit of extra work.

For instance, Google found that sending an additional request for straggling calls in a large distributed query reduced the 99.9th percentile latency by orders of magnitude while only slightly increasing the total number of requests. Similarly, techniques such as tied requests send parallel requests to two servers, but once one server begins processing, the other abandons the task, “racing” two workers to get results faster without doing double the work. These patterns make the system more tail-tolerant by turning the unpredictability of individual nodes into a more predictable overall response. They require more infrastructure support for coordination and redundancy, so they tend to be used in ultra-low-latency services where every millisecond counts.

Finally, consider speed versus quality tradeoffs in certain domains. For example, web search and feed algorithms often choose to return a “good enough” result quickly rather than the perfect result slowly. In a fan-out context, this might mean that if 95 out of 100 parallel calls return promptly with relevant data, the system may ignore the 5 slow responses and proceed with the majority of the data. By doing so, the tail doesn’t hold up the result. The user sees a fast response and likely doesn’t notice the omission of those few pieces. This concept only applies when the work done by the slowest paths has diminishing returns, of course, but when applicable, it’s a good way to reduce tail latency.

The overarching principle across these strategies is resilience: Assume that in a fan-out architecture, some fraction of components will misbehave at any given time, and build in ways to mitigate that, whether by failing fast, doing redundant work, or adapting the output. With engineering, enterprises turn fan‑out from a scalability issue into a routine part of a high-performance system, delivering both the rich functionality and the consistency that users expect.

Webinar: Big billion scale - Scaling high-performance platforms at Flipkart

Flipkart relies on Aerospike as its datastore and caching solution for critical, low-latency use cases like search, recommendations, inventory, pricing, and offers. During sales, the platform handles 90 million QPS across 350+ clusters on a shared, bare-metal Kubernetes environment powered by the Aerospike Kubernetes Operator.

In this session, Aditya Goyal and Sahil Jain share Flipkart’s journey, detailing the strategies, challenges, and optimizations behind operating Aerospike reliably at “Big Billion” scale.

Watch now

Aerospike and fan-out architecture

High-performance data architectures like those at Aerospike are built to excel in fan‑out scenarios. Aerospike is a real-time data platform designed for predictable, low-latency operation even as workloads scale and fan out. Its unique approach, from its patented Hybrid Memory Architecture to features that support multi-record processing in one call, addresses fan-out challenges. By consistently maintaining sub-millisecond responses and tightly controlling latency variability, Aerospike keeps large fan‑out request patterns from compromising the user experience.

Systems that rely on large-scale request fan‑out succeed or fail based on how they behave under real-world conditions: volatile load, deep dependency chains, and inevitable variability across components. In these environments, the limiting factor is rarely peak throughput. It is whether end-to-end behavior remains predictable as fan‑out grows and conditions change.

Aerospike is built for this reality. By tightly bounding per-operation latency and reducing variability at the data layer, it allows fan‑out-heavy systems to remain responsive and controllable as scale, complexity, and demand evolve. The result is not just faster responses in ideal conditions, but a consistent user experience when it matters most: in production, under pressure, and over time.

Try Aerospike Cloud

Break through barriers with the lightning-fast, scalable, yet affordable Aerospike distributed NoSQL database. With this fully managed DBaaS, you can go from start to scale in minutes.

Get started