A story about scale
The 75th minute of the 2022 Qatar World Cup Final is ticking away and Messi’s Argentina is ahead 2-0 against France. Alex’s hopes of winning are slowly slipping. They decide to turn off the TV and begin laundry, which has been overdue for the past week. Continuing to watch the game is too hard and hearing those three final whistles would hurt too much.
They turn off the tv in the grumpiest possible way and begin to set up the washer. Suddenly, they receive a notification on their smart-watch: Kylian Mbappe scored a penalty and France is now just one goal behind Argentina. Alex’s hopes are back, but they’re just starting the washer. They might as well finish quickly and get back to watching the game. A couple of seconds after the washer’s door locks, another notification reaches their phone: Mbappe scored another goal and the game is now tied, 10 minutes before extra time. Alex rushes back to the TV. They are ecstatic to be able to continue watching the final.
It was crucial for them to receive the notifications in a timely manner to be able to witness the rest of the game. This might not just be true for Alex but also for the remaining 400M receiving World Cup updates and the many more receiving other notifications. This is the core of Firebase Messaging engineering goals.
The biggest challenge in maintaining such a service is scale. Events like the World Cup bring together a huge number of fans like Alex all over the world. The type of traffic for such events can be extremely spiky. A very steady traffic volume can suddenly spike when Mbappe scores the goal that ties the game. Suddenly, FCM needs to notify 250M users about this event.
To prepare for an event of this size, Firebase Messaging implemented several changes and best practices that we want to share with the rest of the engineering community. Hopefully, lessons learned here will help you prepare your infrastructure for the next big event.
Understanding the traffic
It is imperative to have a solid understanding of the type of traffic you will need to handle. In particular, you should understand:
- Where the traffic is coming from and where it needs to go.
- The different byte sizes of requests.
- How many users to deliver traffic to and what the per-user traffic pattern will be.
- Expected QPS fluctuations throughout events. This is especially important for spiky traffic where the peak QPS is more challenging than the baseline.
- Whether there will be concurrent events.
Using the past to predict the future
Sometimes you can rely on past events to predict future ones. In the case of the World Cup, past competition figures combined with predicted organic growth helped us make a good prediction of the type of traffic and load we could expect.
To shed some light on the scale of the 2022 World Cup, we delivered over 115B notifications to over 400M users across the globe. During the Argentina - France kick-off, we reached a peak of 46M QPS from our average baseline of 18M in a matter of seconds.
Having now a clear picture of what traffic you can expect, you should focus on the infrastructure that will handle it. You must first understand if the current infrastructure will be able to handle traffic safely. For events the size of the World Cup that’s usually not the case.
You will likely need more capacity. The amount of capacity will be determined by the size and shape of the expected traffic with respect to your baseline one as well as redundancy requirements. You will also need to set up additional capacity and rate limiting controls in case your traffic estimate is off. This will help with the overall success of the event and to avoid impact to your regular customers as much as possible.
Obtaining all of these resources takes time, so it is crucial to plan ahead and account for buffer time in case the provider takes longer to provision what was requested. Whenever possible, a good relationship with the cloud provider can also facilitate certain processes and help you better identify what’s needed.
Cold potato routing
Most cloud providers allow capacity to be distributed across cells worldwide. Whenever feasible, try to provision infrastructure as close as possible to the source and destination of the traffic, so you can get requests in and out of your system more quickly while controlling the QoS and latencies in between. This practice is sometimes called Cold Potato Routing.
If you have latency targets that you want to meet, cross-reference those with the ones guaranteed by your provider. Remember that your spiky traffic is pushing the expected latencies to the limits, so 50th percentile or average latencies are not what you should be looking at. Instead, look at 90th or even 95th percentile latencies.
You should also prepare for the worst-case scenarios: hardware malfunction and natural disasters might bring your infrastructure down, and sometimes, you may have to drain a cluster due to an issue. You should always provision infrastructure in a redundant manner between clusters that share no dependencies to prepare for these scenarios. Try also to provision in new cloud domains from the ones you’re currently in. This will help isolate the traffic generated by the current event from the regular traffic you usually serve and it will enable you to safely perform load tests (more on that later). Most of these ideas are discussed in the SRE book.
When selecting the locations in which you will be provisioning, make sure that your dependencies don’t perform poorly in the same proximity. Load balancers will usually send your traffic to the nearest cluster hosting a dependency (say a storage service). Check that the average and 95th percentile for availability and latency are consistent across cloud domains.
Now that you have provisioned the underlying infrastructure, you need to configure how this will receive traffic. You will most likely have a load balancer taking care of this, but you should ensure it is configured to behave in a strategic manner.
First, try to use the load balancer to separate new traffic from existing traffic. If you know that only certain targets or URLs will receive new traffic, instruct the load balancer to serve these on the newly provisioned infrastructure. After this, ensure that no regular traffic is served there.
This limits the risk the new traffic might have on your regular service, but it’s not always simple to achieve. Downstream services that are global and/or stateful are harder to isolate. Sometimes, isolating the traffic entry point to your system will be sufficient.
Spraying the traffic
Spiky traffic is generally difficult to balance. Some spikes might go from baseline to peak too quickly for the load balancer to react to the load and spill traffic before it overloads the primary compute cluster. A very simple but effective mitigation strategy is to instruct the load balancer to equally spray traffic across “close” cells. What “close” means is highly dependent on the latency you can afford. Sometimes you can only spread traffic between cells that are at most 10ms apart. Other times, you can afford spreading traffic evenly across a continent. Ask yourself, can I afford an extra 20ms so that I can fanout 10x more traffic simultaneously? If so, spread that traffic and win at scale!
You should also be able to do the following for live mitigations:
- Limit traffic in certain cloud domains
- Softly isolate traffic coming from certain senders to given backends
- Spread traffic evenly to “close” areas, where you can control how “close” backends have to be
Defense in depth
While load balancers will help us spread traffic to our jobs in a safe manner, these are unable to address certain traffic patterns. To achieve a more comprehensive defense strategy, you should instruct your replicas to reject requests when they are being overworked. This practice is sometimes called server-side throttling and can be a very effective last-resort option to protect your job’s resource consumption.
Protect the good citizens
Despite being effective, load-balancing and throttling treat all your server requests in the same way and will not help you target harmful users. If a certain sender is overloading your service and is triggering server-side throttling, it would be unfair for other well-behaved users to be targeted by the same blocks. Quota checks address this issue and allow you to reject requests from harmful users while keeping other traffic mostly unharmed. Usually, cloud providers offer centralized quota services that help you manage quotas across users.
An increase in traffic often just requires an increase in the number of server replicas, following a quasi-linear approach. Another very useful approach is to reshape your replicas to either increase/decrease the CPU and/or memory capacity available to each replica. Most cloud providers allow you to flexibly handle the size of a replica, which unlocks the possibility of scaling vertically or horizontally without changing the overall amount of resources utilized.
Vertical scaling is the practice of reducing the number of replicas while also making them larger in CPU/memory/etc. Some advantages include using warmer caches and better code profiling.
Horizontal scaling is the practice of increasing the number of replicas while also making them smaller in CPU/memory/etc. Some advantages include reducing pressure on per-replica bottlenecks, such as the JVM Event Manager. This type of scaling is the one we had to adopt in preparation for the World Cup final.
Last but not least, you should test if your entire setup can manage the expected load and you can achieve this via load tests. These should be planned well before the event as load testing is a generally difficult task. Problems will most likely arise with the tests themselves, such as difficulty to generate the right traffic, sustain load, or with the results.
A general preface on load testing is that you should never perform them on production infrastructure. This will most likely harm your regular traffic and can create various problems. FCM does not support load tests against their servers for this reason. We previously discussed the importance of provisioning new, separate infrastructure, which we were able to leverage for load testing ahead of the World Cup.
Test traffic vs. real traffic
The most important part of a load test is ensuring its traffic follows a very similar pattern to the expected one, and that the requests also are similarly distributed in size and complexity. Often, we just test that the infrastructure is able to handle the peak load for a sustained period, and we forget to test the load balancer reaction to the spikes, the impact on our dependencies, or managing very sporadic requests in a short period of time.
For example, you could be testing for a peak load of 10M QPS sustained for 20s and gradually attained from a baseline of 2M QPS over a period of 10s. Instead, during the event you experience a peak of 5M QPS, reached from a baseline of 1M QPS in less than a second; then, the traffic goes back to 1M QPS in less than a second, back to 5M QPS, and so on.
This will probably cause some of your clusters to get overloaded. Hopefully, the replicas will then be able to throttle the requests, which will reduce the damage but still cause delays and server errors.
Managing high spikes in traffic is certainly no easy task. But equipped with the right best practices and the most effective mitigation strategies, you will be in the best position to handle large traffic volumes.
The most critical themes to keep in mind are:
- Preparation, in the form of capacity planning and load tests
- Defense, of your regular traffic and of your replicas from spikes and harmful senders
With good preparation you too can scale your infrastructure to deliver wins to your customers for the next World Cup!