Frank Talk About Site Outages
(Note: this post is a largely non-technical discussion about technical issues. It’s mainly written for the shoppers and sellers on Etsy, but I’m posting it on our engineering blog because I think engineers might be interested.)
The other night we had an outage.
Put frankly: outages suck. They suck for shoppers, and sellers, and they suck for us who work here at Etsy. When our members come to the site — to manage their business, favorite an item they like, post in the forums, create a treasury, or send a convo — and the site is down, it’s hugely frustrating. One look at Twitter during an outage will give you an idea of exactly how frustrating it is for them. This is their business we’re talking about here.
It’s also our business. The availability and performance of Etsy.com (and Etsy’s API) is of paramount importance to us. That’s the reason for this long-overdue post. We need to be better about communicating what’s going on during an outage, after an outage, and what we’re doing to prevent a repeat in the future. When we communicate well, we’ll further the confidence that Etsy members have in our ability to grow and be a stable place for them to buy and sell the things that they’re passionate about.
You may or may not want to get some coffee here, because I’m going to be a little verbose. At the risk of boring you, I hope to gain your confidence that we take these topics very seriously. Making the site fast and resilient to failure isn’t a side project, it’s core to what we’re doing. When we build new servers and site features, or make changes to existing features, we think about worst-case scenarios. We’re not just trying to avoid them, we’re also figuring out how we’ll respond when something goes wrong that we didn’t imagine.
How We Use Metrics and Alerts
We gather ongoing metrics about all of our servers, networks, and the usage of the site so we won’t be blind to problems. Some of our metrics are set up to trip alarms and alerts so that problems can be fixed right away. Today, we gather a little over 30,000 metrics, on everything from CPU usage, to network bandwidth, to the rate of listings and re-listings done by Etsy sellers.
Some of those metrics are gathered every 20 seconds, 24 hours a day, 365 days a year. About 2,000 metrics will alert someone on our operations staff (we have an on-call rotation) to wake up in the middle of the night to fix a problem.
Let’s say all our servers are running fine, but people in a certain region of the world suddenly aren’t able to get to Etsy.com. An issue like that might be out of our immediate control (say, a fiber cable was cut in the Middle East) but it’s still something we want to be aware of so we can alert our members. So, we use a service called Gomez and run tests to Etsy.com from their many thousands of machines around the globe, every minute, to make sure we’ll detect any patterns of slowness.
As we build out new infrastructure and features, we’re always adding to and adjusting these metrics and alerts; it’s a work in progress. We also use metrics to predict the technical capacity we’re going to need as we grow as a site. Right now we’re evaluating where and when we’ll likely need capacity in the next 8 months, as well as further streamlining the process of provisioning new capacity as we need it.
Outage Coordination and Communication
When we have an outage or issue that affects a measurable portion of the site’s functionality, we quickly group together to coordinate our response. We follow the same basic approach as most incident response teams. We assign some people to address the problem and others to update the rest of the staff and post to http://etsystatus.com to alert the community. Changes that are made to mitigate the outage are largely done in a one-at-a-time fashion, and we track both our time-to-detect as well as our time-to-resolve, for use in a follow-up meeting after the outage, called a “post-mortem” meeting. Thankfully, our average time-to-detect is on the order of 2 minutes for any outages or major site issues in the past year. This is mostly due to continually tuning our alerting system.
Post-Mortems and Historical Context
After any outage, we meet to gather information about the incident. We reconstruct the time-line of events; when we knew of the outage, what we did to fix it, when we declared the site to be stable again. We do a root cause analysis to characterize why the outage happened in the first place. We make a list of remediation tasks to be done shortly thereafter, focused on preventing the root cause from happening again. These tasks can be as simple as fixing a bug, or as complex as putting in new infrastructure to increase the fault-tolerance of the site. We document this process, for use as a reference point in measuring our progress.
Single Point of Failure (SPOF) Reduction
As Etsy has grown from a tiny little start-up to the mission-critical service it is today, we’ve had to outgrow some of our infrastructure. One reason we have for this evolution is to avoid depending on single pieces of hardware to be up and running all of the time. Servers can fail at any time, and Etsy.com should be able to keep working if a single server dies. To do that, we have to put our data in multiple places, keep them in sync, and make sure our code can route around any individual failures.
So we’ve been working a lot this year to reduce those “single points of failure,” and to put in redundancy as fast as we safely can. Some of this means being very careful (paranoid) as we migrate data from the single instances to multiple or replicated instances. As you can imagine, it’s a bit of a feat to move that volume of data around while still seeing a peak of 15 new listings per second, all the while not interrupting the site’s functionality.
Change Management and Risk
We’re constantly making improvements to Etsy.com. Changing an existing working website can be scary. How do you know that the change you’re going to make won’t crash the site? The fact is, not all changes are created equal; some are riskier than others. But in 2010, the majority of site issues have not been caused by changes we’ve made on the site. Instead, they’ve been hardware-related. We put a lot of work into making it safe to make changes of all kinds to Etsy.com, and treat every change as a possible breaking point. For every type of technical change, we have answers to questions like:
- What problem does the change solve?
- Has this kind of change happened before? Is there a successful history?
- When is the change going to start? When is it expected to end?
- What is the expected effect of this change on the Etsy community? Is a downtime required for the change?
- What is the rollback plan, if something goes wrong?
- What test is needed to make sure that the change succeeded?
As with all change, the risk involved and the answers to these questions are largely dependent on the judgment of the person at the helm. At Etsy, we believe that if we understand the likely failures, and if there’s a plan in place to fix any unexpected issues, we’ll make progress.
Just as important, we also track the results of changes. We have an excellent history with respect to the number of successful changes. This is a good record that we plan on keeping.
If you’ve actually read this far – congratulations. Like I said, this post was meant to be detailed about how we take site availability and outages seriously. Etsy is growing insanely fast, and keeping up with growth can sometimes be like changing the tires on a moving car. Downtime is something that happens on every web service, but that’s not an excuse to accept it.
It’s a privilege to build and grow this site for such an involved and passionate community. In the same way that it’s our job as engineers to build, grow, and evolve our code and infrastructure, it’s also our job to communicate with you, our members, about that work. And in times of distress (i.e. site outages) it’s even more important.
We can always be better at that.