Snap: Developing faster, cheaper, and highly scalable solutions at scale
Please accept marketing cookies to view this content.
Srini Srinivasan:
Hello, welcome to the session on SNAP. It's my great pleasure to introduce Saral Jain, who's the senior director of engineering at SNAP. Saral has over 15 years of experience leading engineering organizations. Currently, he leads the infrastructure data and the IT teams at SNAP. He's here to talk to us about how at SNAP, he and his team have helped develop faster, cheaper, and highly scalable solutions at scale. Saral.
Saral Jain:
Thanks Srini for the warm welcome. Hey everyone, I'm Saral. As Srini mentioned, I'm a senior director of engineering at SNAP and I lead our enterprise and cloud services department. It's our infrastructure data and IT organizations, and I'm super excited to be here to chat about how we use Aerospike at SNAP.
Before I talk about Aerospike, let me quickly introduce Snapchat. Snapchat is an app that hundreds of millions of people around the world use to communicate with their close friends. Our focus is on close friendships and we enable the fastest way to communicate through features like personalized maps, chat, camera, stories, and spotlight. Snapchat over the last decade has built some of the most popular features for our friends to communicate with each other, including ephemeral messaging, which is one of the first things with Snapchat got really popular with. In addition to that, we have built features like Bitmoji, personalized maps, vertical videos, and spotlight to enable friends to communicate with each other. We have over 300 million daily active users on the Snapchat platform, and as you can imagine, that puts an insane burden on our infrastructure just because of the sheer scale which our infrastructure has to deal with for our global community of users across the world.
Let me quickly talk about infrastructure at SNAP. Back in 2016, Snapchat was a monolithic architecture. We had one server application running majority of the features on Snapchat. It was one monolithic deployment on Google App Engine, which contained hundreds of endpoints, and it had a shared Google app engine Memcache and a shared Google app engine datastore. While this architecture worked really well for us to initially scale our app to millions of users, we eventually started seeing some bottlenecks with this architecture. First was the performance. Because of our global user base, the performance really suffered when users had to come all the way to the US to make requests. Second, operational excellence because all of the features was in one monolithic application, the blast radius was really high. And third, this particular architecture was really expensive for us.
In the last few years, we have been re-architecting our Snapchat backend to be more performant, cost-efficient, and operationally reliable. We have gone from using a single cloud provider to using multiple cloud providers. We have gone from a monolithic architecture to a service-oriented architecture, which contains thousands of microservices running across both our cloud providers. We have gone from an unreliable operationally expensive architecture to a completely operationally reliable architecture with things like service availability SLAs, and an OE program that ensures we have a culture of strong operational ownership as now.
We have gone from a single region to a completely regionalized architecture where all of the media is served across regions across the world closest to our users as well as most of our latency critical services are completely regionalized to provide a seamless performance for our customers.
And finally, we have gone from a platform-as-a-service architecture to more of an infrastructure-as-a-service architecture with sufficient managed platforms, which the infrastructure team has built to solve some of the SNAP-specific problems like media delivery and efficiently synchronizing data from our mobile devices back to the cloud.
With this architectural evolution, in 2022, we have a multi-cloud service-oriented, operationally reliable cost-sufficient regionalized architecture, which runs on top of infrastructure as a service solution. Let me talk a little bit about the scale of infrastructure at SNAP.
In the last five years, we have by many metrics grown by multiple orders of magnitude in terms of our infrastructure. We have thousands of microservices, thousands of Kubernetes clusters running. We have over 300 million daily active users using the platform. And that translates to a traffic of over 15 million queries per second on most of our tier zero tier one services. That translates to over 20 million QPS on many of our databases, which they have to support in real time. And many of these databases actually store upwards of hundreds of petabytes of information as well so it's an extremely large-scale architecture which we have built. And next I would love to chat about how Aerospike has helped us enable this architectural scale.
I should quickly talk about a few requirements that any database including Aerospike has in terms of Snapchat to support the massive scale we have. First performance is absolutely the topmost priority so that we can give the right operationally reliable experience and performant experience to our customer base to the end users of Snapchat.
Second, we never risk regressions or downtime. Our database system should never bring the app down so operational reliability is the next most important thing. We require our systems to be highly available and fault tolerant by default.
The third thing is multi-cloud. We prefer to adopt existing solutions, but we build and leverage open source and portable solutions is that we avoid vendor lock-in. We do not want to be locked-in to any single cloud provider.
The fourth is developer experience. We never want to have our database solutions be a bottleneck in executing on new strategic initiatives. We aim to make Snapchat engineering teams more efficient by providing shared infrastructure and frameworks.
And then finally, operational excellence is the key to our success. We prefer solutions that are simple to operate and leverage the capabilities of underlying cloud providers.
So these are some of the requirements we put on Aerospike as a solution, but on all databases which we use at Snapchat. With this, I would love to talk about a few workload categories that we use Aerospike for our Snapchat. The first is the pure cache workload. Second is a high-throughput, low-durability workload, and the third is high-durability or strongly consistent workload.
The first, as I mentioned, is the pure cache workloads. These are typically optimistic low-latency storage systems that are built on top of a duplicate system of record. We typically use things like Memcache, Google Memorystore, AWS ElastiCache, or we host our own Memcache or Redis clusters to satisfy the PR cache workloads.
The benefit we see from Aerospike is immense in this particular workload category. Aerospike provides SSD storage that can help us reduce price per GB of storage. Aerospike provides zonal affinity that reduces cross-zone traffic. Aerospike also provides a lower operational burden than self-hosted Redis clusters. And then finally, Aerospike provides cross-cloud cross region replication that helps us do much less work on the application layer. With this, Aerospike provides a lot of significant benefits on all of the different requirements we have on our databases, and it's a very popular choice [inaudible 00:08:03] Snapchat for pure cache workloads.
In terms of high-throughput, low-durability use cases, these are use cases where we use databases as a system of record, but with loose consistency and durability requirements. We typically used to use solutions like Google Bigtable, GCP Memorystore, AWS ElastiCache Redis or self-hosted Redis for these kinds of use cases. We found Aerospike to be extremely beneficial for these kind of use cases as well because we can exploit the loose business requirements for more cost efficiency. We still see a lot lower latency profile when using Aerospike, and we can again simplify our application logic related to cache inconsistency and replication. We don't have to build those things on the application layer and we can just rely on Aerospike to provide those guarantees for us.
And then finally, the durability or the low consistency or the strong consistency requirements. This is where the databases are the authoritative system of record where we can have consistent data and almost no tolerable data loss. We use things like Google Datastore, AWS DynamoDB or Google Spanner for these kinds of use cases. Even here Aerospike shines in terms of its lower latency profile, easier cloud-agnostic data migration across clouds, and we can leverage hard disk drives for again, lower cost. So across all of these different types of workload categories, we have found that Aerospike is the logical choice for all of them and we use them across the board.
Next, I would love to talk about a very specific use case where we evaluated Aerospike against Memcache. Aerospike was faster, cheaper, and offered zonal failover for this particular workload. And this is a representative workload where all of our other workloads saw similar kind of advantages with Aerospike as well. This workload is a pure cache workload with a hundred to one read write ratio, 10 to 30 KB object size, 250,000 read QPS and 5,000 write QPS. It's a representative workload for pure cache workloads at SNAP. Here we saw Aerospike giving us less than one millisecond P-fifty read, latencies less than one millisecond P-ninety read latencies and three milliseconds P-ninety-nine latencies, which is far superior than Memcache and we see this across write latencies as well.
On top of that, we saw that we were able to run Aerospike one third cost as compared to Memcache, which is huge at our scale, those kinds of cost optimizations. And at the same time, we were able to store one copy per AC, which gave us zonal failover, which is something Memcache was not able to give us at all. So we clearly saw all the advantages of using Aerospike for this representative workloads and across other workloads at Snapchat as well.
In terms of adoption, at this point, Aerospike forms the backbone of Snapchat architecture. It powers all of the five screens on Snapchat, including our Discover and Spotlight features, our monetization features, our core infrastructure including our user data and the friend graph for Snapchat itself, our growth features as well as our creative tools features. These features are deployed across multiple regions across the world, and they're deployed on both Google Cloud as well as AWS Cloud.
And then finally, I wanted to recap all of the business impact Aerospike has helped us realize in the last few years. First, on the performance side, Aerospike offers really low latencies as compared to other databases, especially on the tail latency side, which is really important at our scale.
Second, it helps us minimize hops on data redistribution, and again, in terms of performance, it minimizes the cross A-Z hops, which really helps end user latencies.
Second primary business impact is on cost efficiency. Aerospike helps us get substantially lower price per GB for storage. It helps us with lower bandwidth costs and then finally, it is viable for low consistency, high throughput use cases, which I talked about in one of the previous slides.
And then finally, very important for our business, Aerospike helps us get cloud-agnostic databases, so we are able to do consistent cross-cloud caching and XDR for cross-region and cross-cloud replication without rewriting all of these features in our application layer, which helps us a lot from a business perspective, but also from a developer efficiency perspective. So across performance, cost efficiency and cloud-agnostic, Aerospike has served us really well in the last few years and it has provided a lot of business impact to our Snapchat architecture.
So those are all the main benefits I wanted to chat about in terms of how Aerospike has helped us scale to orders of magnitude more scale across the thousands of microservices we have, which powers Snapchat backend, and I would like to invite Srini to have a open Q&A and I'm happy to answer any questions.
Srini Srinivasan:
Thank you, Saral. That was a really good presentation. Definitely enjoyed hearing what the benchmarks as well as the various ways in which you're solving these hard problems. I know that essentially you have a lot of real-time business challenges at SNAP that you overcome over the last several years. Can you kind of give our audience some insights into how your teams, the infrastructure teams and data teams help to solve these problems, these business challenges for your company?
Saral Jain:
Absolutely. Let me start by describing some of the business challenges, which are real-time. First is the engagement itself. Ultimately, Snapchat is built for Snapchatters. We want them to have superior performance and a seamless experience when they're chatting with their close friends on the Snapchat app. That can only happen through features like real-time ranking, real-time personalization of data, and that's really where all of the business challenges come in to be able to provide those features at our scale.
Secondly, for our business, it's really important for us to monetize the data on Snapchat and we have lots of very real-time signals, which we use to ensure we provide the right ads to the right customers at the right time so that we can monetize the app.
And then finally, it's about performance. We have customers around the world and we want to provide them seamless experience. So based on real-time data, we are able to change the app so that it can be seamless and performant to our customer base. And my team's infrastructure and data are able to be the underlying platform which all of the feature engineering teams at Snapchat can rely on to do all of the undifferentiated heavy lifting to provide better customer engagement, better performance and better monetization capabilities on our platform. We do this by providing insights such as compute, networking, databases, caching and operational excellence as a feature to all of the feature engineering teams so that they can focus on [inaudible 00:15:18].
Srini Srinivasan:
So can you speak to how Aerospike and other such real-time systems have been helpful to your team in addressing these challenges?
Saral Jain:
Absolutely. Aerospike has been a critical part of our infrastructure, especially in terms of providing lower latencies, providing more cost efficiency, and providing a better performance for our infrastructure layer as we deal with these large-scale challenges.
Srini Srinivasan:
One of the things which has always fascinated me in terms of working with you and your teams at SNAP is how much Snapchat has grown in terms of its, year over year, the growth is just amazing. So my question to you is, do you have any tips and advice for technologists who might be interested in undertaking the scaling to the order of a hundred x that you have been privileged to work on over the last several years?
Saral Jain:
That's actually a great question. It makes me really think about the journey we have had the last five years as we have grown multiple orders of magnitude in terms of our infrastructure usage.
One thing which I would definitely advise everybody who's going through that journey is to always know what that end state looks like. When we started this journey, we always knew that we wanted to be a multi-cloud regionalized service-oriented architecture, and we were able to create a vision which rallied all of our teams behind that vision to execute in the last few years to be able to adapt to the hundred x growth, which we have had. As part of that, always keep your end customers in mind. What is the best possible experience for your customers is the best possible decision for your business.
In addition to that, we have also tackled cultural challenges. We always have to ensure that as a culture, we are thinking about hundred x growth and every single decision we are taking essentially. And that has been instrumental for us to be able to think long-term and think big rather than taking small tactical decisions at any given point in time.
Srini Srinivasan:
Yeah, that's really good that you're going through this change and handling it successfully. As a final question, let me ask you, what are maybe three trends that you see are going to be important for us to pay attention to over the next two to three years?
Saral Jain:
That's a great question. There's a lot. Our industry's moving at an extremely fast pace at this point. From my vantage point, I feel like we will continue this path towards a multi-cloud architecture where most companies would be relying on 1, 2, 3, or several cloud providers depending on the business needs that they have.
In addition to that, I believe data governance and data compliance and data locality are going to be huge in the next few years. As more and more governments and more and more data governance requirements get introduced in terms of policies and laws, we will have to comply to those.
And then finally, connectivity is the third thing which comes to my mind. More and more of the world is getting connected with things like 4G and 5G internet and we will be able to leverage that to kind of provide our business and our opportunities to a larger set of users across the world. And I'm really excited about all of these three challenges and how we can continue to partner together with Aerospike to solve all of these going forward.
Srini Srinivasan:
Okay. Well, we've come to the end of this session, so thank you very much Saral for presenting so well, but also answering some of our questions here which came up. And wish you and your team all the best in continuing to scale up as you go forward. Thank you.
About this webinar
Snap, the parent company to the global social media app, Snapchat, faces multiple challenges with its globally popular, real-time applications. Hear how Saral Jain, Senior Director of Engineering, Enterprise and Cloud Services at Snap, is overcoming these challenges with Aerospike to build not just the proverbial “faster/cheaper”, but to do so at massive scale and growth rates.
Watch this webinar to learn how Snap leverages Aerospike to power important features such as: – Monetization – Core infrastructure – Growth management – Creative tools – Discovery/spotlight features (i.e. search/advertising)
Plus, get an overview of Snap’s infrastructure evolution, workloads, and overall business impact with Aerospike.