Inside Flipkart’s journey to 90 million QPS: Scaling with Aerospike and Kubernetes
During the Aerospike Bangalore Summit, Aditya Goyal and Sahil Jain of Flipkart, India's leading e-commerce platform, shared how they use Aerospike to power a billion-scale shopping experience.
Every year, Flipkart’s Big Billion Days sale pushes its systems to the limit, with millions of concurrent app sessions and billions of product views hitting its backend in real time. Keeping performance predictable at that scale requires a database that maintains sub-millisecond latency and bounded tail performance, even during massive traffic spikes.
At the Aerospike Bangalore Summit, Aditya Goyal and Sahil Jain from Flipkart’s engineering team pulled back the curtain on how they transformed Aerospike from a collection of self-managed clusters into a unified internal service layer capable of handling India’s largest e-commerce traffic peaks.
How Flipkart delivers real-time experiences at a massive scale
When Flipkart’s Big Billion Days sale goes live, the platform becomes one of the busiest digital ecosystems in the world. Every tap, search, and scroll triggers a cascade of reads and writes across hundreds of services, with Aerospike handling the real-time data that powers them.
Today, Flipkart runs over 50 distinct use cases on Aerospike, including everything from the search bar on the homepage to the systems behind recommendations, ads, pricing, and inventory. Together, these clusters process an aggregate 90 million queries per second across three data centers.
That level of throughput is maintained by a core engineering team of fewer than ten developers, responsible not only for uptime but for evolving Aerospike into an internal service layer. Each use case, from caching to source-of-truth systems, depends on predictable latency measured in microseconds.
From self-managed clusters to centralized service
Early in Flipkart’s Aerospike journey, every team built and managed its own cluster. Product search had one, ads had another, recommendations had theirs. All were tuned differently and monitored separately, with each team provisioning infrastructure, handling scaling, performing upgrades, and recovering from outages independently.
That autonomy has helped Flipkart experiment quickly, but over time, it became unsustainable at scale. Different teams were, again and again, addressing the same infrastructure operations, such as managing replication factors, balancing nodes, and troubleshooting storage issues. Meanwhile, some clusters were over-provisioned while others were under-provisioned.
“The real challenge… was managing infrastructure, scaling, and recovery across so many independent teams,” Goyal said.
And so, a small core team was formed with a clear mission: to build Aerospike as a shared service for the entire organization. This centralization had several goals:
Subject-matter expertise: A single team responsible for Aerospike best practices, performance tuning, and
troubleshooting
Version consistency: All clusters running on verified builds with certified upgrades and resilience testing
Developer productivity: Application teams could now request clusters through a unified interface and focus on features, not infrastructure
Simplified disaster recovery: Centralized replication and failover strategy instead of fragmented recovery playbooks
The result was a cultural shift. Instead of building and maintaining isolated clusters, Flipkart teams could consume Aerospike the same way they used other internal services: on-demand, predictable, and governed by shared standards.
Evolving Aerospike into a cloud-native service
Centralizing Aerospike management was only half the challenge. To truly scale, Flipkart also pursued a system that could deploy, monitor, and heal itself without engineers logging into virtual machines or manually resizing clusters.
After evaluating multiple approaches, the team found that Kubernetes offered the flexibility and control they needed to automate Aerospike deployments as code. With it came the Aerospike Kubernetes Operator (AKO), a control layer that automated everything from provisioning to version upgrades.
“AKO helped us with deployment. Everything starts with a manifest file that defines pods, memory, CPU, and configuration,” Goyal explained. “We could define pod shape, memory, storage, CPU, and let the system handle the orchestration.”
The architecture runs on two logical planes: a control layer that automates scaling, monitoring, and management, and a data layer that hosts Aerospike clusters across Kubernetes namespaces with built-in load balancing and fault tolerance.
This design turned what used to be manual, error-prone operations into reproducible workflows. A new cluster could be created in minutes by simply updating a configuration file in GitHub. Meanwhile, a CI/CD pipeline merged that file with Helm charts to generate manifests, which AKO automatically deployed.
Flipkart also introduced namespace-level isolation to manage diverse workloads. High-performance, latency-sensitive systems run on newer hardware in dedicated namespaces, while less critical or hybrid workloads reuse older infrastructure. Separate namespaces for in-memory and hybrid topologies ensure each use case gets the right balance of throughput and durability.
Even scaling, traditionally one of the hardest problems in database operations, became nearly effortless. Horizontal scaling could be triggered in response to load; vertical scaling, which Kubernetes doesn’t natively support, was managed by AKO, creating parallel deployments and automatically shifting workloads.
All of this meant the Aerospike team could focus on observability, optimization, and innovation, rather than maintenance. “In this whole journey, the code or the actual virtual machines are never touched,” Goyal said.
Inside Flipkart’s governance and ecosystem
Flipkart also needed a governance layer to ensure hundreds of clusters across different workloads, hardware generations, and teams could run predictably and safely. That meant unifying how clusters were deployed, monitored, and optimized.
At the center of this system sits the Aerospike DBaaS portal, a web-based interface that acts as the front door for every Flipkart team using Aerospike. Internal tenants can register their use cases, define latency and throughput goals, and submit configuration requirements directly through the portal. Behind the scenes, the platform automatically determines the best-fit hardware and topology, whether in-memory for ultra-low latency or hybrid for persistence and efficiency.
“Through the DBaaS portal, teams can just register their use case and performance needs, and the system takes care of provisioning the right hardware,” Jain said.
Unified observability
To manage more than 200 active clusters, Flipkart built a comprehensive monitoring and alerting stack combining Aerospike and Kubernetes metrics into a single view.
Each Aerospike pod runs sidecar containers that export performance data via Prometheus, aggregated into Grafana dashboards. This unified observability allows the team to distinguish whether latency spikes originate from the database layer or infrastructure layer, a distinction that’s critical at scale.
Custom alerting pipelines flag anomalies automatically, and asynchronous logging streams everything to a central store, ensuring that database events, pod restarts, and network metrics share a single audit trail. The result is a live operational view that combines memory usage, namespace capacity, disk I/O, and network bandwidth.
Reliability through testing
To maintain resilience, Flipkart practices continuous chaos testing, where engineers deliberately kill pods or simulate partial failures to validate recovery behavior. The system tracks metrics such as data loss, latency during degradation, and recovery time for both the database and Kubernetes layers.
The same ethos extends to disaster recovery. After extensive benchmarking, the team determined that Aerospike Cross Datacenter Replication (XDR) offers better recovery point objectives (RPO) for high-availability use cases, while backup-and-restore workflows deliver faster recovery time objectives (RTO) and stronger protection against data corruption.
Reliability at record scale
The impact of Flipkart’s Aerospike transformation extends far beyond infrastructure automation.
Deployments that once took days now complete in minutes, with zero-downtime rolling upgrades
Over 200 clusters operate with consistent sub-millisecond reads, even during Big Billion Days peaks
XDR replication and continuous chaos testing ensure that recovery and failover are predictable, measurable, and verifiable
Operational overhead has dropped sharply, freeing engineers to focus on optimization rather than maintenance
Lessons from Flipkart’s experience
By rethinking Aerospike as a service layer rather than an individual component, Flipkart has eliminated fragmentation, reduced operational overhead, and given its developers the freedom to innovate without worrying about infrastructure. Governance and automation have become the quiet enablers of agility, ensuring every team, from search to ads to recommendations, can move fast without breaking reliability.
What’s next for Flipkart
The next frontier is all about intelligence, in particular, building a system that can anticipate, adjust, and optimize itself long before an engineer needs to intervene.
That vision is already taking shape. The Aerospike platform’s governance layer is evolving into a predictive engine, powered by telemetry from hundreds of clusters. By analyzing fragmentation reports, usage trends, and real-time performance data, the system can forecast when workloads will outgrow their current hardware and preemptively allocate resources.
Flipkart is also building an AI-driven DevOps engine that’ll serve as an automation layer capable of handling routine operational tasks, such as AKO upgrades, cluster version rollouts, and performance validation. Instead of manually running scripts, engineers can visualize, approve, or roll back updates through a simple interface.
From edge reliability to predictive operations, Flipkart’s Aerospike journey shows how real-time infrastructure can evolve as intelligently as the business it powers.
Keep reading
Inside Airtel’s Converged Data Engine: Real-time intelligence powered by Aerospike

Inside Arpeely’s real-time feature store for ad personalization

Inside Adobe’s real-time customer data platform: 89% savings leveraging Aerospike’s tiered storage engines

Why the next wave of travel marketplace innovation starts in the data layer

