How Etsy Uses Code “Slush” to Manage Development During the Holidays
Note: This article was adapted from an internal Etsy newsletter earlier this year. As the holidays roll around, it seemed like a timely opportunity to share what we do with a larger audience.
As the calendar year draws to a close, people’s thoughts often turn to fun activities like spending time with family and friends and enjoying pumpkin or peppermint flavored treats. But for retailers, the holiday season is an intense and critically important period for the business.
The months of November and December compose nearly a fifth of all US retail sales and pretty much every retailer needs to undertake special measures during the holidays, from big sales promotions to stocking up on popular items, to hiring additional staff to stock inventory and reduce wait times at checkout.
A lot of these measures apply as well to digital retailers, with the added risk of the entire site running slowly or not at all. In 2015, Neiman Marcus experienced an extended outage on Black Friday and Target and PayPal were intermittently down on Cyber Monday.
Etsy is no stranger to this holiday intensity. This is the biggest shopping season of the year for us and we typically receive more site visits than at other times, which translate into more orders. Over the years, our product, engineering, and member-facing organizations have developed practices and approaches to support our community during the intensity of the holidays.
How Etsy Handles the Holidays
The increase in site traffic and transactions impacts many areas of the business. Inbound support emails and Non-Delivery Cases reach a peak in December, and the Trust and Safety team ramps down outside project work and hiring efforts to focus on providing exceptional support.
“Emotions tend to be more heightened around the holidays” says Corinne Haxton Pavlovic, head of the trust and safety team at Etsy, “There’s a lot on the line for everyone this time of year – highs can feel higher and lows can feel lower. We have to really dig into our EQ and make sure that we’re staying neutral and empathetic.”
For our sellers, the holidays can be an exciting but scary time: “the Etsy sales equivalent of Laird Hamilton surfing a 70-ft wave off Oahu’s North Shore” says Joseph Stanek, seller account manager. Stanek works with a portfolio of top Etsy sellers to advise and support their business growth. He’s found that many sellers spend an enormous amount of effort on holiday sales promotion, and are then hit with a record number of orders. They’re pushed to increase their shipping and fulfillment capabilities, which can serve “as a kind of graduation” as they level up to a new tier of business.
With huge numbers of buyers browsing for the perfect gift, and sellers working hard to manage orders, it’s critically important for Etsy’s platform to be as clear and reliable as possible. That’s why the period between mid-October through December 31st is a time to be exceptionally careful and more conservative than usual about how we make changes to the site. We affectionately refer to this period as “Slush.”
The Origins of Slush
The actual term “Slush” is a play on the phrase “code freeze,” which is when a piece of software is deemed finished and thus no new changes are being made to the code base. “Code freezes help to ensure that the system will continue to operate without disruptions,” says Robert Tekiela at CTO Insights. It’s a way to prevent new bugs from being created and and are “commonly used in the retail industry during holiday shopping season when systems load is at a peak.”
Since at Etsy, we still push code during the holidays, just more carefully, it’s not a true code freeze, but a cold, melty mixture of water and ice. Hence, Slush.
According to Jason Wong, Slush at Etsy got started sometime after current CEO Chad Dickerson became CTO in the fall of 2008. As an engineering director of infrastructure, Jason has been a key part of Etsy’s platform stability since joining in 2010. Back then, Etsy’s infrastructure was less robust and the team was still figuring out how to effectively support the already high levels of traffic on Etsy.com. There was not yet a process for managing code deployments during the holidays and the site experienced more crashes than it does today. .
Said Wong, “the question was: during the holiday season, a high traffic, high visibility time when we made a significant portion of our [gross merchandise sales], how do we stabilize the site? That’s where Slush got started.”
Here’s the slightly redacted email from Chad that kicked off the idea of Slush.
From: Chad Dickerson
Date: Fri, Oct 31, 2008 at 4:08 PM
Subject: holiday “slush” (i.e. freeze) — need your input
To: adminName and adminName
Cc: adminName, adminName, and adminName
adminName / adminName,
adminName, adminName, and I met recently to discuss a holiday “freeze” beginning at end of day on November 14. We’re calling it a “slush” because there are certain types of projects that we can still do without making critical changes to the database or introducing bugs. The goal with setting this freeze is to eliminate the distractions of any projects not on the must-do list so we can focus on the most important projects exclusively.
adminName, adminName, adminName, and I met earlier this week to mutually agree on what projects we need to complete before the freeze/slush beginning at the end of the day on 11/14. We came up with a list of “must do” projects that need to get done by 11/14, and “nice to have” projects if we complete the must-dos:
[ Link to Document Detailing Slush Plan ]
There are couple of projects that we’ve discussed already on the list, like Super Etsy Mini and BuyHandmade blog. I wanted to make sure that any “must do” projects in your worlds are reflected and prioritized. On Monday, we’re going to start doing daily standups at 11:30am to track progress against the agreed-upon list of projects leading up to 11/14. Since there will be 9.5 working days to execute, we need to freeze the list itself by Monday. Can you review the list and let us know if you have additions that we are missing that we should discuss? Thanks.
Learning to Build Safely
In the early days, Slush was far more strict, in part because Etsy’s infrastructure was not as robust and mature as it is today. We operated off of a federated database model, which in theory was meant to prevent one database crash from affecting another, but in practice, it was hard to keep clusters from affecting one another and site stability suffered as a result. This technical approach also made it hard to understand what went wrong and how the team could fix it.
Engineers went from deploying five times a day to once a day. Feature flags were tested thoroughly so that major features like convos or add to cart could be turned off without shutting down the site.
Over the past few years, a major effort was made to get Etsy’s one really big box called master onto a sharded database model. With a sharded system, data is distributed across a series of smaller active-active pairs so that if single database goes down, there is an exact replica with the data intact. This system is faster, more scalable, and resilient compared to the prior method of simply storing all the data in one really big box. In 2016, we successfully migrated all of our key databases, including receipts, transactions, and many others, “to the shards” and decommissioned the old master database.
Developing continuous deployment was also a major feat which allowed Etsy to develop A/B testing and feature flagging. These technical efforts, in conjunction with our culture of examining failures through blameless postmortems, have allowed allowed Etsy to get better at building safely. Today our engineering staff and systems are studied by organizations around the world.
Slush Today and Tomorrow
Within the engineering organization, there’s often a senior staff member who helps organize Slush. The role is an informal one, meant to share best practices and encourage the product org to be mindful of the higher stakes of the holidays. Tim Falzone, an engineering manager on Infrastructure, took on this role in 2015 and presented a few slides at the September Engineering All-Hands which highlight the way we handle Slush today.
Today, Slush means that major buyer and seller facing feature changes are put on hold, or pushed “dark,” where they are hidden behind config flags and not shown publicly. Additionally, engineers get more people to review their pull requests of code changes. These extra precautions are taken to ensure that the site runs quickly and with minimal errors or downtime even with the increased traffic.
Falzone says that now, Slush is less about not breaking the site and more about preventing disruptions for members. “You could easily make something that works flawlessly but tanks conversion or otherwise sends buyers away, or is really slow,” he explained. For sellers, managing a huge wave of orders means relying on muscle memory of how Etsy works, which means that the holidays is a bad time to change the workflow or otherwise add friction for our sellers, who often become profitable on their business for the year during this time.
As Etsy grows, Slush will continue to evolve. A more powerful platform also means more points of integration. More traffic means more pressure on parts of the platform.
Even as we work to secure more headroom for our infrastructure and develop tooling to stress-test our systems, we will always be challenged in new ways. Though we’ve come a long way, Slush will continue to be a helpful reminder to move safely during a critical time of the year for our members and our organization.
Posted by Ian Malpass and Toria Gibbs on 10 Aug, 2016
Posted by John Goulah on 15 Sep, 2015