Infrastructure upgrades with Chef

Posted by on August 2, 2013

Infrastructure overview

As we have written before, at Etsy we run our production stack exclusively on physical hardware. This, albeit being less elastic in terms of bringing new hosts up, gives us the power to serve 1.5 billion page views a month on a relatively small number of machines. In addition to our production infrastructure, every engineer and designer also have their own VM to develop on the Etsy stack (see this post for details). This brings us to over 1000 hosts managed by Chef. They are all connected to one Chef server and we run the same cookbooks on all hosts. We have about 30 engineers regularly making changes to our cookbooks. In order to make this workflow smooth, we have built knife spork, a knife plugin that helps with versioning cookbooks, updating environments and interacting with the chef server. For a full overview of our chef workflow check out the Velocity workshop and this presentation at the Chef NYC meetup. And for some more background on our Chef setup, see this talk from ChefConf 2012. This workflow allows us to promote changes to our production infrastructure about 20 times on an average day.

How we roll out changes

The rather static nature of our infrastructure means that we don’t usually spin up new hosts to test our changes and then switch over as this would also require a lot of extra infrastructure to do at this frequency. We have two major ways of how we test and roll out changes to the live infrastructure.

Testing with knife spork and knife flip

In our Chef workflow we have three environments, production, development and testing. The production and development environments contain all of our nodes serving etsy.com and nodes which make up our development infrastructure respectively. Those environments contain explicitly pinned cookbook versions, meaning that all nodes in those environments run exactly the specified version of a cookbook. The third environment – “testing” – doesn’t have any version constraints and has no specified set of nodes it contains. While we test changes to our cookbooks, we move a node that would be affected by those changes to the testing environment. As soon as the cookbook version is updated and the new cookbook is uploaded to the server it will be run on all hosts in the testing environment and thus also the node we just moved there for. So if we want to change something in the apache cookbook we can depool a web node, change it to be in the testing environment instead of production, trigger a Chef run (or wait 10 minutes as this is the interval with which we run chef-client) and test the changes. We have also created a knife plugin called knife flip which automates the environment change with a simple command. After the changes are tested on the web node, we flip it back to production and promote the changes to the cookbook to be used in the development and production environments.

While this is useful for short lived testing, it essentially blocks everybody else from working on this cookbook. This is often fine, since not everybody
usually has to do work on the same cookbook at the same time. However longer lasting ramp ups (for example rolling out a new version of PHP to all web nodes) would be impossible as it would block the cookbook for weeks. That’s why we use a different approach for rolling out major changes to hosts.

Whitelisting hosts

The basic pattern of our whitelisting approach is based on data bags which contain a list of hosts that are allowed to receive the update. In the recipe we then test whether or not the current node is in this whitelist and if it is, run a different branch of the recipe. This is very similar to how we do branching in code in our web code via the Feature API. In order to make this easier we created a library cookbook which we are releasing as Open Source today.

Introducing chef-whitelist

chef-whitelist is a simple library cookbook which we include as a dependency in all cookbooks that run a whitelisted change. Adding and using a whitelist can be accomplished in two simple steps. First create a new data bag in the “whitelist” namespace. This is the default namespace and can be changed to whatever you want. The data bag has to at least contain a “patterns” key (also configurable) with an array of hostnames. To make it easier to whitelist groups of similar nodes, wildcard hostnames are also allowed in there. Then in your recipe you can do something like this:

if node.is_in_whitelist? "new_whitelist"
# new hawtness
else
# old way of doing things
end

Now everytime this recipe is run, it checks the node for inclusion in the whitelist “new_whitelist” and then acts accordingly.

Verdict

For infrastructure changes we have the same continuous deployment mentality with which we develop on the web stack. Small changes are built up and continuously deployed. For this we have two ways to ramp up and test smaller changes as well as more elaborate ones which need to be rolled out more slowly. The whitelist library has lowered the bar to overcome when we want to hide a change behind a rollout flag and makes it easier for engineers that are not working with Chef in a daily manner to safely roll out their changes.

How do you handle rolling out changes to your infrastructure? How does it tie into your development workflow? Let us know in the comments.

You can follow Daniel on Twitter at @mrtazz

Posted by on August 2, 2013
Category: engineering, infrastructure, operations

8 Comments

Thanks Daniel, interesting topic for me while I’m building a bunch of Chef recipes. What tools do you use for testing the nodes that you’ve flipped?

Thanks,

Matt

    Hey Matt,

    We usually just do manual testing on the nodes we flip. Making sure the changes are there and things are still working (and not alerting in Nagios).

Excellent write up Daniel. Have you thought of an approach which doesn’t require the Chef recipe to contain both the old and new ways? I’m thinking that with a tool like librarian-chef there may be some way to manage running different versions of the same cookbook on different nodes based on the data bag. This keeps your recipes short, easy to understand, and doesn’t require you to go back later and clean out the old way of doing things.

    We haven’t really considered such a solution, since we are not using a workflow based on librarian or berkshelf. Also this way it is less confusing to find out what is going on in a recipe just by looking at the current git checkout. Otherwise we’d have to jump between different versions to get the same information. The problem with cleaning out the old way is definitely there and right now we just have to remember to do it when the rollout is done.

Daniel, thanks for sharing. Feature flags for infrastructure == awesome.

When you flip a node into the “testing” environment, how do you ensure that the cookbook running in that environment isn’t going to mutate the system in such a way that when you flip it back, it’s not the same as its original peers?

    Hi Julian,

    that’s a good question. Usually once we have verified the changes on the node in “testing” we promote the cookbook before flipping it back. That way all nodes “catch up” to the node in testing and they are all the same again when the testing node joins. If something goes wrong on the flipped node, our changes are usually small enough so we can just remove them on there by hand and then rerun chef with the node being back in its original environment again.

[…] So how does Etsy avoid imploding when all those devs are making frequent changes to production at the same time? Their continuous delivery pipeline is made possible by Chef. […]

Chef – Configuration Management & Infrastructure Automation

Simple Chef Workflow:   Proposed Chef Recipe Deplo