Frank Talk About Site Outages

Posted by on September 17, 2010

Walter's Mystery Control Panel, by Telstar Logistics, on Flickr

(Note: this post is a largely non-technical discussion about technical issues. It’s mainly written for the shoppers and sellers on Etsy, but I’m posting it on our engineering blog because I think engineers might be interested.)

The other night we had an outage.

Put frankly: outages suck. They suck for shoppers, and sellers, and they suck for us who work here at Etsy. When our members come to the site — to manage their business, favorite an item they like, post in the forums, create a treasury, or send a convo — and the site is down, it’s hugely frustrating. One look at Twitter during an outage will give you an idea of exactly how frustrating it is for them. This is their business we’re talking about here.

It’s also our business. The availability and performance of Etsy.com (and Etsy’s API) is of paramount importance to us. That’s the reason for this long-overdue post. We need to be better about communicating what’s going on during an outage, after an outage, and what we’re doing to prevent a repeat in the future. When we communicate well, we’ll further the confidence that Etsy members have in our ability to grow and be a stable place for them to buy and sell the things that they’re passionate about.

You may or may not want to get some coffee here, because I’m going to be a little verbose. At the risk of boring you, I hope to gain your confidence that we take these topics very seriously. Making the site fast and resilient to failure isn’t a side project, it’s core to what we’re doing. When we build new servers and site features, or make changes to existing features, we think about worst-case scenarios. We’re not just trying to avoid them, we’re also figuring out how we’ll respond when something goes wrong that we didn’t imagine.

How We Use Metrics and Alerts

We gather ongoing metrics about all of our servers, networks, and the usage of the site so we won’t be blind to problems. Some of our metrics are set up to trip alarms and alerts so that problems can be fixed right away. Today, we gather a little over 30,000 metrics, on everything from CPU usage, to network bandwidth, to the rate of listings and re-listings done by Etsy sellers.

Some of those metrics are gathered every 20 seconds, 24 hours a day, 365 days a year. About 2,000 metrics will alert someone on our operations staff (we have an on-call rotation) to wake up in the middle of the night to fix a problem.

Let’s say all our servers are running fine, but people in a certain region of the world suddenly aren’t able to get to Etsy.com. An issue like that might be out of our immediate control (say, a fiber cable was cut in the Middle East) but it’s still something we want to be aware of so we can alert our members. So, we use a service called Gomez and run tests to Etsy.com from their many thousands of machines around the globe, every minute, to make sure we’ll detect any patterns of slowness.

As we build out new infrastructure and features, we’re always adding to and adjusting these metrics and alerts; it’s a work in progress. We also use metrics to predict the technical capacity we’re going to need as we grow as a site. Right now we’re evaluating where and when we’ll likely need capacity in the next 8 months, as well as further streamlining the process of provisioning new capacity as we need it.

Outage Coordination and Communication

When we have an outage or issue that affects a measurable portion of the site’s functionality, we quickly group together to coordinate our response. We follow the same basic approach as most incident response teams. We assign some people to address the problem and others to update the rest of the staff and post to http://etsystatus.com to alert the community. Changes that are made to mitigate the outage are largely done in a one-at-a-time fashion, and we track both our time-to-detect as well as our time-to-resolve, for use in a follow-up meeting after the outage, called a “post-mortem” meeting. Thankfully, our average time-to-detect is on the order of 2 minutes for any outages or major site issues in the past year. This is mostly due to continually tuning our alerting system.

Post-Mortems and Historical Context

After any outage, we meet to gather information about the incident. We reconstruct the time-line of events; when we knew of the outage, what we did to fix it, when we declared the site to be stable again. We do a root cause analysis to characterize why the outage happened in the first place. We make a list of remediation tasks to be done shortly thereafter, focused on preventing the root cause from happening again. These tasks can be as simple as fixing a bug, or as complex as putting in new infrastructure to increase the fault-tolerance of the site. We document this process, for use as a reference point in measuring our progress.

Single Point of Failure (SPOF) Reduction

As Etsy has grown from a tiny little start-up to the mission-critical service it is today, we’ve had to outgrow some of our infrastructure. One reason we have for this evolution is to avoid depending on single pieces of hardware to be up and running all of the time. Servers can fail at any time, and Etsy.com should be able to keep working if a single server dies. To do that, we have to put our data in multiple places, keep them in sync, and make sure our code can route around any individual failures.

So we’ve been working a lot this year to reduce those “single points of failure,” and to put in redundancy as fast as we safely can. Some of this means being very careful (paranoid) as we migrate data from the single instances to multiple or replicated instances. As you can imagine, it’s a bit of a feat to move that volume of data around while still seeing a peak of 15 new listings per second, all the while not interrupting the site’s functionality.

Change Management and Risk

We’re constantly making improvements to Etsy.com. Changing an existing working website can be scary. How do you know that the change you’re going to make won’t crash the site? The fact is, not all changes are created equal; some are riskier than others. But in 2010, the majority of site issues have not been caused by changes we’ve made on the site. Instead, they’ve been hardware-related. We put a lot of work into making it safe to make changes of all kinds to Etsy.com, and treat every change as a possible breaking point.  For every type of technical change, we have answers to questions like:

As with all change, the risk involved and the answers to these questions are largely dependent on the judgment of the person at the helm. At Etsy, we believe that if we understand the likely failures, and if there’s a plan in place to fix any unexpected issues, we’ll make progress.

Just as important, we also track the results of changes. We have an excellent history with respect to the number of successful changes. This is a good record that we plan on keeping.

If you’ve actually read this far – congratulations. Like I said, this post was meant to be detailed about how we take site availability and outages seriously. Etsy is growing insanely fast, and keeping up with growth can sometimes be like changing the tires on a moving car. Downtime is something that happens on every web service, but that’s not an excuse to accept it.

It’s a privilege to build and grow this site for such an involved and passionate community. In the same way that it’s our job as engineers to build, grow, and evolve our code and infrastructure, it’s also our job to communicate with you, our members, about that work. And in times of distress (i.e. site outages) it’s even more important.

We can always be better at that.

Posted by on September 17, 2010
Category: infrastructure, operations, people, philosophy

21 Comments

I work in the IT world and I can totally relate frustrated users.

Thank you for always trying to make the Etsy better for all of us. Keep up the good work!

I was hoping this post would include a photo of you standing up on a desk, yelling at everyone to shut the fuck up if they’re not working on the outage 😀

In seriousness, I think a detailed “prose” write up of one of your outages and what happened would be a fascinating read. I think infrastructure is discussed openly enough these days that one could write a fairly informative outage narrative without giving away any trade secrets that aren’t already out there. I know I have plenty of hilarious/interesting war stories from my year at Flickr – that means you could probably write your next book with yours.

Sure appreciate all you guys do! One day last week I wondered if it was me — every time I hit “Reply” to a convo the site went down 🙂 Your analogy of changing the tires of a moving vehicle is good! Thank you for not only keeping up with the rapid growth of Etsy, but also for the continued site improvements.

Nice post, but what exactly happened?

This theme is so dark that I can’t see the outlines of the comment fields. Could someone turn on a light in here?

What @Salman said. How about an equally (technically) deep post about the root cause, symptoms, troubleshooting, and takeaways from this specific outage? Either here or etsystatus.com.

[…] what they are doing to better prevent these.  You can find out more about these outages over on  Code As Craft. Join the Handmadeology Forum discussion on this post Share Etsy News Etsy News […]

“How about an equally (technically) deep post about the root cause, symptoms, troubleshooting, and takeaways from this specific outage?”

Agreed; where’s the beef? This is like a policy description, not a forensics document. A sort-of dodge, if you ask me.

Mission accomplished! The communication of the engineering world to the rest of us was effective.
Thank you for taking time to write this so well.

“You may or may not want to get some coffee here, because I’m going to be a little verbose”…

Oh fuck.

[…] is of paramount importance to us. That’s why I’ve written a long-overdue post at our Code as Craft engineering blog, which I encourage you to […]

..THE CHRISTMAS SEASON IS UPON US!!

I see by these articles…. that no one at Etsy is willing to explain what the problem is.

WE ARE ALL BUSINESS PEOPLE HERE ON ETSY… why are we being treated like children on a “need to know basis… and we don’t need to know!?”

Is the problem… that no one at Etsy knows what the problem is?…. and they have been sold a bill of goods by their tech team and their new software provider?

…They have really taken us way backwards with many of the new system software functions they have installed for us to use.

______________________________________________

While I really like the new page changes Etsy made …. the working parts and functions of “moving our stores around” are now so primative I am insulted at having to put up with using them.

For weeks …I am no longer able to successfully use the “leave comments function” for any treasury that has been kind enough to show my artisan designs. I have to leave Etsy and go through craftopolis to get this done.

I am really getting mad that the one FREE MARKETING TOOL I have by leaving “comments” in treasury collections and on the front page of etsy …. is no longer working from my store!!!
Ridiculous!!!

_____________________________________________
Questions…

~ what is the problem with fixing this… new software?
~ what is the problem with the Etsy tech department?
And is anyone fixing this?

~ I get the feeling that in order to fix the problems they are having… they expect us to muddle through the Christmas buying season… so Etsy has the money in the new year to make the fixes!! Wouldn’t that just stink!

Satement…
I am so tired of dragging my photos five times before they stick where I want them… I just can’t take it any more!

… it actually is making me late for my real job every day!

__________________________________________

I don’t know…. does everyone need to put their stores
“out on vacation” untill the problem is fixed?

Radical yes… but everywhere I look…. technology is getting better, easier to work… with higher standards of graphic technologies and visual appeal!

    Hi there. Here are some suggestions that may help explain or solve the issues regarding moving photos and commenting on Treasuries.

    Photos: There are a few possible scenarios:

    1) You can move them, but when you look at the actual item in your shop, it hasn’t moved yet. So, you try again and again. Eventually enough time passes where it takes. This is likely due to the way things cache on the site. When things get really busy on Etsy it can take a bit of time for all of the servers to synch up to changes in a shop. Announcements and other shop updates are affected by this. If you wait about 10 or so minutes, you will see your change reflected.

    2) It could be javascript related. Something in your browser/security setup might be stalling javascript. http://isjavascriptenabled.com. It may also be that you have a OS/browser combo we don’t support?

    3) It could also merely be speed of your network connection (though I’m guessing it’s not this).

    Comments:
    1) Again it could be an OS/Browser combo.
    2) As you know, you must be logged in. Assuming you are, it may be that there are lots of comments about that treasury and your comment is not appearing on the first couple pages.

    Hope this helps!

Thanks for the work you do at Etsy. I expect you are doing all that you can, with what you have. I too get frustrated with not getting everything I want when I think I need it but life has never worked that way for me or the people I know. There are many people at Etsy who make my life easier and better.
Thank-you.

Nice write up but it certainly feels like a pre-cursor to RFO rather than one of itself. Good to see you’re thinking about your mistakes, analysing and learning from them though. The more you fail the less fail you’ll become.

[…] got you down? Frank talk about site outages (via @etsy’s Code As Craft […]

Cool post. I agree with other commentators, it would be interesting to see a write up on a specific incident.

[…] got you down? Frank talk about site outages (via @etsy’s Code As Craft […]

Thank you for writing this post. I work in IT for a small organization. Our team is built of 10 people. I’ve started realizing we need change management and tools in place to help us resolve and track issues.

You’ve given me a starting point but can you provide any in-depth processes and tools to help maintain public facing sites and internal infrastructure?

[…] downtime somewhere on the interwebs, this seems to be the season for some of my most dependable and favorite web apps (not just Twitter) to be dealing with suffering some pretty disruptive […]

I found this blog by accident but wanted to say I’m super happy with Etsy, been a seller since 2007 and always think that the tech department works quickly resolving any issues. With all servers and all companies there are problems, but overall with Etsy I can count the hiccups in four going on five years on one hand, so that’s something for you guys to be proud of!