Infrastructure upgrades with Chef
In 2020 we updated this post to adopt more inclusive language. Going forward, we’ll use “primary/replica” in our Code as Craft entries.
This post in particular names an open-source repository that we need to rename. We are actively working on that and will update this post ASAP.
As we have written before, at Etsy we run our production stack exclusively on physical hardware. This, albeit being less elastic in terms of bringing new hosts up, gives us the power to serve 1.5 billion page views a month on a relatively small number of machines. In addition to our production infrastructure, every engineer and designer also have their own VM to develop on the Etsy stack (see this post for details). This brings us to over 1000 hosts managed by Chef. They are all connected to one Chef server and we run the same cookbooks on all hosts. We have about 30 engineers regularly making changes to our cookbooks. In order to make this workflow smooth, we have built knife spork, a knife plugin that helps with versioning cookbooks, updating environments and interacting with the chef server. For a full overview of our chef workflow check out the Velocity workshop and this presentation at the Chef NYC meetup. And for some more background on our Chef setup, see this talk from ChefConf 2012. This workflow allows us to promote changes to our production infrastructure about 20 times on an average day.
How we roll out changes
The rather static nature of our infrastructure means that we don’t usually spin up new hosts to test our changes and then switch over as this would also require a lot of extra infrastructure to do at this frequency. We have two major ways of how we test and roll out changes to the live infrastructure.
Testing with knife spork and knife flip
In our Chef workflow we have three environments,
testing. The production and development environments contain all of our nodes serving etsy.com and nodes which make up our development infrastructure respectively. Those environments contain explicitly pinned cookbook versions, meaning that all nodes in those environments run exactly the specified version of a cookbook. The third environment – “testing” – doesn’t have any version constraints and has no specified set of nodes it contains. While we test changes to our cookbooks, we move a node that would be affected by those changes to the testing environment. As soon as the cookbook version is updated and the new cookbook is uploaded to the server it will be run on all hosts in the testing environment and thus also the node we just moved there for. So if we want to change something in the apache cookbook we can depool a web node, change it to be in the testing environment instead of production, trigger a Chef run (or wait 10 minutes as this is the interval with which we run chef-client) and test the changes. We have also created a knife plugin called knife flip which automates the environment change with a simple command. After the changes are tested on the web node, we flip it back to production and promote the changes to the cookbook to be used in the development and production environments.
While this is useful for short lived testing, it essentially blocks everybody else from working on this cookbook. This is often fine, since not everybody
usually has to do work on the same cookbook at the same time. However longer lasting ramp ups (for example rolling out a new version of PHP to all web nodes) would be impossible as it would block the cookbook for weeks. That’s why we use a different approach for rolling out major changes to hosts.
The basic pattern of our allowlisting approach is based on data bags which contain a list of hosts that are allowed to receive the update. In the recipe we then test whether or not the current node is in this allowlist and if it is, run a different branch of the recipe. This is very similar to how we do branching in code in our web code via the Feature API. In order to make this easier we created a library cookbook which we are releasing as Open Source today.
chef-whitelist is a simple library cookbook which we include as a dependency in all cookbooks that run a allowlisted change. Adding and using a allowlist can be accomplished in two simple steps. First create a new data bag in the “whitelist” namespace. This is the default namespace and can be changed to whatever you want. The data bag has to at least contain a “patterns” key (also configurable) with an array of hostnames. To make it easier to whitelist groups of similar nodes, wildcard hostnames are also allowed in there. Then in your recipe you can do something like this:
if node.is_in_whitelist? "new_whitelist" # new hawtness else # old way of doing things end
Now everytime this recipe is run, it checks the node for inclusion in the allowlist “new_whitelist” and then acts accordingly.
For infrastructure changes we have the same continuous deployment mentality with which we develop on the web stack. Small changes are built up and continuously deployed. For this we have two ways to ramp up and test smaller changes as well as more elaborate ones which need to be rolled out more slowly. The whitelist library has lowered the bar to overcome when we want to hide a change behind a rollout flag and makes it easier for engineers that are not working with Chef in a daily manner to safely roll out their changes.
How do you handle rolling out changes to your infrastructure? How does it tie into your development workflow? Let us know in the comments.
You can follow Daniel on Twitter at @mrtazz