How does Etsy manage development and operations?
I’ve been loving using Quora these past few months, and have been amazed at the level of behind-the-scenes detail people are providing about really complex and specific things (like how Facebook does automated testing).
Recently, someone asked, “How does Etsy manage development and operations?” with these comments: Etsy seems to have scaled far and fast, whilst continuing to add new features; how is all this managed – is there a strictly-defined process within which engineers operate, or is it a case of hiring clever people and letting them get on with it (Facebook-style)?
First of all, I love the team and am proud of the work that they do. It’s an amazing group and none of this would work or be as fun as it is without them.
So, here’s the answer I just posted:
In 2010, we did grow the engineering team pretty fast, going from 20 to about 70, and the rest of the company grew quickly, too. As we grew, overall speed has been really important to us, and we’ve continually tuned our processes, tools, and culture to support that. I wrote about some of these principles behind all of it in my blog over the summer: http://bit.ly/foHPc1
Right now, developers are divided up into a number of small teams, usually 3-7 engineers. These teams are paired with a product manager and a designer, and there is some movement across teams as needed. All designers at Etsy code and product managers code at various levels, too. Ops and dev work really closely together, and we have one development team that is very ops-like and straddles both domains. Everyone in the company uses IRC. Lots of ideas are worked out on a wiki, and people around the company comment on those ideas and plans (we use Confluence). Some projects form organically, and others are more top-down.
We generally plan in 60-day chunks and divide the deliverables up into 2-week periods (though we’re not officially using capital-A Agile). The 60-day cycle has no special significance — we just felt like it was a reasonable timeframe for planning near-term deliverables. The 60-day plans go through a review, we set goals, and we publish the plans on the wiki. Our founder, CEO, and head of product (Rob Kalin) participates in these reviews and stays in close contact with the product and engineering teams throughout. In general, the teams have a lot of autonomy in how they get their work done within a set of architectural principles we’ve established (a subject for another post) and our overall design approach. Specs are typically very light, and the focus is on building working features.
We onboard engineers quickly and their first goal is simple: deploy on your first day. The goal here is to constantly emphasize shipping, and get over any deployment fears early. Engineers get productive very quickly. The level of cooperation between developers and ops is also really high (see our engineering blog for more: http://etsy.me/hMtu1A)
We practice continuous deployment and make small changes frequently to the site. We use what we call “config flags,” which are more or less an exact copy of what Flickr does (see the Flickr engineering blog: http://bit.ly/dZZzfY) and a lot of the code for features runs “dark” for days or weeks, and feature launches mean flipping a switch in the code. We have a lot of Flickr DNA in the company (John Allspaw, our VP of Ops, ran ops at Flickr, and Kellan Elliott-McCrea was architect at Flickr). In January (a month in which we did over a billion page views), code committed by 76 unique individuals was deployed to production by 63 different folks a total of 517 times. Product managers make changes and do deploys (here’s Jenn Vargas, one of our newest product managers, tweeting about it) and we have trained aspiring developers on our support team to make small changes with our help and guidance, too. Our deployment environment requires a lot of trust, transparency, communication, coordination, and discipline across the team. We’ve invested a lot in our automated unit and functional testing (we have a team devoted just to this), tooling for deployment (see our blog post about Deployinator: http://etsy.me/c6RJD7), and metrics and monitoring (see “Tracking Every Release”: http://etsy.me/e1ULhO). Key system-level and business level metrics (like checkout/listing/registration/sign-in rates) are projected on screens in the office and we have a number of internal dashboards that the team uses (we mainly use Ganglia and Graphite). We also have lots of switches and knobs to help us roll features out to percentages of users and ramp them up slowly, or quickly. Features are used and tested by us here at Etsy for some period of time before they are rolled out publicly.
When we make mistakes, we conduct blameless post-mortems and assign remediation items to the appropriate team members. Engineers frequently post in our community forums when we have any issues and we have a status blog that we maintain (http://www.etsystatus.com/). I think that interacting with Etsy members gives everyone a deeper sense of responsibility for the code we’re writing. We also write about the mistakes we make pretty openly (http://etsy.me/hgZ4qh).
Overall, engineers are treated as creative collaborators in the overall process with design and product, and products are worked out and iterated on with engineers instead of simply being handed to them for implementation. Rob (our founder and head of product) likes working with engineers and the engineers spend a lot of time interacting with Rob. Our ability to work this way has as much to do with the personalities of the people involved and the culture as the technologies involved. We’re always learning and adjusting and we’ll continue to evolve as time goes on.