Tracking Every Release

Posted by Mike Brittain | Filed under data, engineering, operations

We spend a lot of time gathering metrics for our network, servers, and many things going on within the code that drives Etsy. It’s no secret that this is one of our keys to moving fast. We use a variety of monitoring tools to help us correlate issues across our architecture. But what most monitoring tools achieve is correlating the effects of change, rather than the causes.

Change to application code (deploys) are opportunities for failure. Tweaking pages and features on your web site cause ripples throughout the metrics you monitor, including database load, cache requests, web server requests, and outgoing bandwidth. When you break something on your site, those metrics will typically start to skew up or down.

Something obviously happened here… but what was it? We might correlate this sudden spike in PHP warnings with a drop in member logins or a drop in traffic on our web servers, but these point to effects and not to a root cause.

We need to track changes that we make to the system.

Different companies track change in ways that are reflective of their release cycle. A company that only releases new software or services once or twice a year might literally do this by distributing of a press release. Companies that move more quickly and release new products every few weeks might rely on company-wide emails to track changes. The faster the iteration schedule, the smaller and less formal the announcement becomes.

When you reach the point of releasing changes a couple of times a day, this needs to be automated and needs to be distributed to places where it is quickly accessible, such as your monitoring tools and IRC channels. At Etsy, we are releasing changes to code and application configs over 25 times a day. When the system metrics we monitor start to skew we need to be able to immediately identify whether this is a human-induced change (application code) or not (hardware failure, third-party APIs, etc.). We do this by tracking the time of every single change we ship to our production servers.

We’ve been using Graphite for monitoring application-level metrics for nearly a year now. These include things like numbers of new registrations, shopping carts, items sold, image uploaded, forum posts, and application errors. Getting metrics into Graphite is simple, you send a metric name, a value, and the current Unix timestamp. To track time-based events, the value sent for the metric can simply be “1″. Erik Kastner added this right into our code deployment tool so that every single deploy is automatically tracked. You didn’t think we did this by hand, did you?

events.deploy.website 1 1287106599

The trick to displaying events in Graphite is to apply the drawAsInfinite() function. This displays events as a vertical line at the time of the event. (Hat tip: Mark Lin, since this is not well documented.) The result looks like this:

http://graphite.example.com/render/
?target=drawAsInfinite%28events.deploy.website%29
&from=-24hours

Graphite has a wonderfully flexible URL API that allows for mixing together multiple data sets in a single graph. We can mix our code deployments right into the graph of PHP warnings we saw above.

Ah-ha! A code change occurred right after 4 PM that set off the warnings. And you can see that a second deploy was made about 10 minutes later that fixed most of the warnings, and a third deploy that squashed anything remaining.

We maintain a battery of thousands of tests that run against our application code before every single deploy, and we're adding more every day. Combined with engineers pairing up for code reviews, we catch most issues before they get deployed. Tracking every deploy allows us to quickly detect any bugs that we missed.

Equally useful is the reassurance we have that we can deploy many times a day without disrupting core functionality on the site. Across the 16 code deploys shown below, not a single one caused an unexpected blip in our member logins.

These tools highlight the good events along with the bad. Ian Malpass, who works on our customer support tools, uses Graphite to monitor the number of new posts written in our forums, where Etsy members discuss selling, share tips, report bugs, and ask for help. When we correlate these with deploys, you can see the flurry of excitement in our forums after one of our recent product launches.

Automated tracking of code deploys is essential for teams who practice Continuous Deployment. Monitoring every aspect of your server and network architecture helps detect when something has gone awry. Correlating the times of each and every code deploy helps to quickly identify human-triggered problems and greatly cut down on your time to resolve them.


26 responses to Tracking Every Release

  • Found a link to this post from HackerNews, and decided to give it a look as I have bought several items from Etsy (good interface, I just find “awful” the fact that payment has to be split among Etsy stores, but I guess it is how it should be) in the past.

    The interesting part, is that I had never thought about how this kind of problems got managed in middle-sized companies (i.e. not Amazon, Ebay, Google or anything that big). How to track changes, how to see when something goes foul and so on. Thanks for sharing this little bit of information.

    Cheers,

    Ruben
    Latest in my blog:The emacs 30 Day Challenge

  • memo says:

    It seems that something like mixpanel could solve the big problem here?

  • Ori Lahav says:

    good one.
    Outbrain is also running on continues deployment and I defiantly agree such tool is essential.
    Production changes are not only code deploys, it can be many infrastructure changes (adding systems, DNS changes, load balancing, network changes, etc…) these are usually not going via code deployment tool.
    currently we are using Yammer channel with the hashtag #prodchange to manually follow it but it won’t take that into events logging and graphing.
    did Etsy solved that too?

  • Phill says:

    Another great post.. I’m still waiting for the one that explains how you guys use Matlab though!!

  • Q Wade says:

    At Kynetx, we are wanting to move in the direction of continuous deployments and this article has greatly helped my thinking about the complexity of achieving this goal.

    What other pearls of wisdom do you have for an organization who wants to move really fast….safely?

    Thanks for the post.

  • Dan McKinley says:

    Phill, hopefully we will write a post about it eventually, but the short version is that our recommendations use the Matlab toolchain on EC2.

  • [...] This type of framework can make the reporting and graphing work a lot simpler and quicker. No more worrying about graphing your data and, as Etsy found, you can mix-in other things like your deployment points. [...]

  • Nick says:

    Very interesting tool. Do you use any db monitoring tools and, if so, have you integrated with Graphite?

  • mikebrittain says:

    Nick, we use tools like Cacti and Ganglia to monitor the health of our DBs.

    We monitor and graph some bits of database content (e.g. total registered users) with Graphite so we can plot growth rates over time. This can be done with some lightweight scripts that execute an SQL query and then send the result (a number) to Graphite.

  • Ryan says:

    I realize it’s been a while since you posted this, but could you share any details about how you collect the metrics like PHP warnings, logins, and forum posts and add them to Graphite?

  • Ryan, You’ve actually touched on three metrics that are collected in very different ways. I can tell you a high-level summary, but details will probably be in forthcoming posts.

    We aggregate metrics from various logs (e.g. our PHP errors, warnings, fatals, etc.) using a tool that is based originally on “Ganglia-logtailer” (https://bitbucket.org/maplebed/ganglia-logtailer). These are driven by a 1-minute cron job.

    In other places we will execute DB queries (e.g. count of users who logged-in within the last 60 seconds), and also drive that by a 1-minute cron. This one is probably the most straight-forward.

    Lastly, we have a daemon that we’ve written to listen for events that are sent from our application code. Every 10 seconds it aggregates everything it has collected and sends the results to Graphite.

    I’m sure there will be future posts about these. There are a lot of details.

  • [...] levels: network, machine, and application. (You can read more about our graphs in Mike’s Tracking Every Release [...]

  • Joachim De Lombaert says:

    Can you share your retention configuration from storage-schemas.conf for your events.* metrics? We’re trying to do something similar with drawAsInfinite() but the lines only show up if the time frame is set to less than 3 hours, and even recent events disappear if the displayed time window is larger than that.

  • Joachim, It looks like we capture a lot of 10 second data points for our “events.” The retention looks like this:

    10:120960,60:262974,600:262974

  • [...] there are things we can do, too, to notice things faster. One thing we are working to add is business metric graphs. We have useful data in Ganglia right now, but we will be using Graphite and Etsy‘s StatsD to [...]

  • [...] – The Second one is to the Nagios where it is registered and from now on it will be shown as vertical line on each of the Nagios graphs. So, In case there is a problem that will be visible in the nagios graphs – we will be able to attribute it to the deployment in proximity. The graphs are exposing 2 pieces of data: the revision number that can be searched in Yammer and the name of the engineer responsible for it so we can refer to him to ask questions. Note: this functionality was inspired from Etsy engineering – Track Every Deployment. [...]

  • Derek says:

    Thanks for some great insight into tracking deployments and monitoring error rates. The relationships between deploys and forum posts is pretty awesome. I’m sure you guys will be a guiding force in helping me grow as a developer.

  • [...] Jenkins instance only dates back to July) so we still have a lot of work to do in terms of building awesome dashboards. And in the future we’d like to harvest more realtime data from both the tests, and from [...]

  • [...] Track every release (Mike Brittain). Here, we write about the methods we use to track the success of every code deploy with application metrics. This is part of the not-so-secret sauce. [...]

  • [...] Ganglia, Munin, etc.What kind of graphs can you make with Graphite? Lets see some cool examples:Etsy’s php warnings correlated with code deploys:  Other:May 22, 2006 — Myspace Friend MapsJuly 31, [...]

  • [...] Mike Brittain on using monitoring to notice significant changes to their site [...]

  • yoavaner says:

    Great post. Are you aware of the graphite `/events` interface? It allows you posting richer event data than just the “standard” metrics, so you can also add things like `what`, `tags` and `data`. We use it together with our giraffe graphite dashboard (https://github.com/kenhub/giraffe), so we not only see deployment events, but also more info such as the commit message. We can filter events by tags and so on…

    Graphite supports `target=events(‘*’)` as well as specifying multiple tags, but we found it’s better and more efficient to use its `events/get_data` interface to get the richer (and lighter-weight) json response.

  • [...] Tracking every release (Etsy, Code as Craft blog) [...]

  • [...] rely on a little trick that Etsy first showed off (codeascraft – track every release ) where you can graph any value in Graphite as a vertical line when the value is 1 or more.  If [...]

  • […] already visualize code deploys against the myriad graphs we generate, to lend context to whatever we’re measuring.  We use […]

  • […] Send-GraphiteMetric – submits metrics to the Graphite Carbon Daemon using UDP. This can be useful on its own, for example if you want to send a metric so you know when you are about to deploy a new patch from the developers. You can compare the time of this metric to environment performance and see if the patch caused any performance impacts. Etsy has a great artilce on how they do this with Graphite here: http://codeascraft.com/2010/12/08/track-every-release/ […]

  • Leave a Response

    Recent Posts

    About

    Etsy At Etsy, our mission is to enable people to make a living making things, and to reconnect makers with buyers. The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.

    Code as Craft is proudly powered by WordPress.com VIP and the SubtleFlux theme.

    © Copyright 2014 Etsy