How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020
For Etsy, 2020 was a year of unprecedented and volatile growth. (Simply on a human level it was also a profoundly tragic year, but that’s a different article.) Our site traffic leapt up in the second quarter, when lockdowns went into widespread effect, by an amount it normally would have taken several years to achieve.
When we looked ahead to the holiday season, we knew we faced real uncertainty. It was difficult to predict how large-scale economic, social and political changes would affect buyer behavior. Would social distancing accelerate the trends toward online shopping? Even if our traffic maintained a normal month-over-month growth rate from August to December, given the plateau we had reached, that would put us on course deep into uncharted territory.
Graph of weekly traffic for 2016-2020. This was intimidating!
If traffic exceeds our preparation, would we have enough compute resources? Would we hit some hidden scalability bottlenecks? If we over-scaled, there was a risk of wasting money and spoiling environmental resources. If we under-scaled, it could have been much worse for our sellers.
For context about Etsy, as of 2020 Q4 we had 81 million active buyers and over 85 million items for sale.
Modulating Our Pace of Change
When we talk about capacity, we have to recognize that ultimately Etsy’s community of sellers and employees are the source of it. The holiday shopping season is always high-stakes so we adapt our normal working order each year so that we don’t over-burden that capacity.
- Many sellers operate their businesses at full stretch for weeks during the season. It’s unfair to ask them to adapt to changes at a time when they’re already maxed out.
- Our Member Services support staff are busy when our sellers are busy. If we push changes that increase the demand for support, we undermine our own efforts to provide excellent care.
- Our infrastructure teams are on alert to scale our systems against novel peaks and to maintain overall system health. Last-minute changes could easily compromise their efforts.
In short, we discourage, for a few weeks, deploying changes that might be expected to disrupt sellers, support, or infrastructure. Necessary changes can still get pushed, but the standards are higher for preparation and communication. We call this period “Slush” because it’s not a total freeze and we’ve written about it before.
For more than a decade, we have maintained Slush as an Etsy holiday tradition. Each year, however, we iterate and adapt the guidance, trying to strike the right balance of constraints. It’s a living tradition.
Modeling History To Inform Capacity Planning
Even though moving to Google Cloud Platform (GCP) in 2018 vastly streamlined our preparations for the holidays, we still rely on good old-fashioned capacity planning. We look at historical trends and system resource usage and communicate our infrastructure projections to GCP many weeks ahead of time to ensure they will have the right types of machines in the right locations. For this purpose, we share upper-bound estimates, because this may become the maximum available to us later. We err on the side of over-communicating with GCP.
As we approached the final weeks of preparation and the ramp up to Cyber Monday, we wanted to understand whether we were growing faster or slower than a normal year. Because US Thanksgiving is celebrated on the fourth Thursday of November rather than a set date, it moves around in the calendar and it takes a little effort to perform year-to-year comparisons. My teammate Dany Daya built a simple model that looked at daily traffic but normalized the date using “days until Cyber Monday.”
This would come in very handy as a stable benchmark of normal trends when customer patterns shifted unusually.
Adapting Our “Macro” Load Testing
Though we occasionally write synthetic load tests to better understand Etsy’s scalability, I’m generally skeptical about what we can learn using macro (site-wide) synthetic tests. When you take a very complex system and make it safe to test at scale, you’re often looking at something quite different from normal operation in production. We usually get the most learning per effort by testing discrete components of our site, such as we do with search.
Having experienced effectively multiple years of growth during Q2, and knowing consumer patterns could change unexpectedly, performance at scale was now an even bigger concern. We decided to try a multi-team game day with macro load testing and see what we could learn. We wanted an open-ended test that could help expose bottlenecks as early as possible.
We gathered together the 15-plus teams that are primarily responsible for scaling systems. We talked about how much traffic we wanted to simulate, asked about how the proposed load tests might affect their systems and whether there were any significant risks.
The many technical insights of these teams deserve their own articles. Many of their systems can be scaled horizontally without much fuss, but the work is a craft: predicting resource requirements, anticipating bottlenecks, and safely increasing/decreasing capacity without disrupting production. All teams had at least one common deadline: request quota increases from GCP for their projects by early September.
We were ready to practice together as part of a multi-team “scale day” in early October. We asked everyone to increase their capacity to handle 3x the traffic of August and then we ran load tests by replaying requests (our systems served production traffic plus synthetic replays). We gradually ramped up requests, looking for signs of increased latency, errors, or system saturation.
While there are limitations to what we can learn from a general site-wide load test, scale day helped us build confidence.
- We confirmed all projects had enough quota from GCP
- We confirmed our many scaling tools worked as intended (Terraform, GKE, instance group managers, Chef, etc.)
- We exposed bottlenecks in the configuration of some components, for example some Memcache clusters and our StatsD relays, both of which were quickly addressed
But crucially, we confirmed many systems looked like they could handle scale beyond what we expected at the peak of 2020.
Cresting The Peak
Let’s skip ahead to Cyber Monday, which is typically our busiest day of the year. Throughput on our sharded MySQL infrastructure peaked around 1.5 million queries per second. Memcache throughput peaked over 20M requests per second. Our internal http API cluster served over 300k requests per second.
Normally, no one deploys on Cyber Monday. Our focus is on responding to any emergent issues as quickly as possible. But 2020 threw us another curve: postal service interruptions meant that our customers were facing widespread package delivery delays. It only needed a small code change to better inform our buyers about the issue, but we’d be deploying it at the peak hour of our busiest day. And since it would be the first deployment of that day, the entire codebase would need to be compiled from scratch, in production, on more than 1000 hosts.
We debated waiting till morning to push the change, but that wouldn’t have served our customers, and we were confident we could push at any time. Still, as Todd Mazierski and Anu Gulati began the deploy, we started nominating each other for hypothetical Three Armed Sweaters Awards. But the change turned out to be sublimely uneventful. We have been practicing continuous deployment for more than a decade. We have invested in making deployments safe and easy. We know our tools pretty well and we have confidence in them.
We have long maintained a focus on scalability at Etsy, but we all expected to double traffic over a period of years, not just a few months. We certainly did not expect to face these challenges while working entirely distributed during a pandemic.
We made it to Christmas with fewer operational issues than we’ve experienced in recent memory. I think our success in 2020 underscored some important things about Etsy’s culture and technical practices.
We take pride in operational excellence, meaning that every engineer takes responsibility not just for their own code, but for how it actually operates in production for our users. When there is an outage, we always have more than enough experts on hand to mitigate the issue quickly. When we hold a Blameless Postmortem, everyone shares their story candidly. When we discover a technical or organization liability, we try to acknowledge it openly rather than hide it. All of this helps to keep our incidents small.
Our approach to systems architecture values long-term continuity, with a focus on a small number of well-understood tools, and that provided us the ability to scale with confidence. So while 2020 had more than its share of surprising circumstances, we could still count on minimal surprises from our tools.