Migrating to Chef 11

Posted by on October 16, 2013

Configuration management is critical to maintaining a stable infrastructure. It helps ensure systems and software are configured in a consistent and deterministic way. For configuration management we use Chef.  Keeping Chef up-to-date means we can take advantage of new features and improvements. Several months ago, we upgraded from Chef 10 to Chef 11 and we wanted to share our experiences.

Prep

We started by setting up a new Chef server running version 11.6.0.  This was used to validate our Chef backups and perform testing across our nodes.  The general plan was to upgrade the nodes to Chef 11, then point them at the new Chef 11 server when we were confident that we had addressed any issues.  The first order of business: testing backups.  We’ve written our own backup and restore scripts and we wanted to be sure they’d still work under Chef 11.  Also, these scripts would come in handy to help us quickly iterate during break/fix cycles and keep the Chef 10 and Chef 11 servers in sync.  Given that we can have up to 70 Chef developers hacking on cookbooks, staying in sync during testing was crucial to avoiding time lost to troubleshooting issues related to cookbook drift.

Once the backup and restore scripts were validated, we reviewed the known breaking changes present in Chef 11.  We didn’t need much in the way of fixes other than a few attribute precedence issues and updating our knife-lastrun handler to use run_context.loaded_recipes instead of node.run_state().

Unforeseen Breaking Changes

After addressing the known breaking changes, we moved on to testing classes of nodes one at a time.  For example, we upgraded a single API node to Chef 11, validated Chef ran cleanly against the Chef 10 server, then proceeded to upgrade the entire API cluster and monitor it before moving on to another cluster.  In the case of the API cluster, we found an unknown breaking change that prevented those nodes from forwarding their logs to our log aggregation hosts.  This episode initially presented a bit of a boondoggle and warrants a little attention as it may help others during their upgrade.

The recipe we use to configure syslog-ng sets several node attributes, for various bits and bobs.  The following line in our cookbook is where all the fun started:

if !node.default[:syslog][:items].empty?

That statement evaluated to false on the API nodes running Chef 11 and resulted in a vanilla syslog-ng.conf file that didn’t direct the service to forward any logs.  Thinking that we could reference those nested attributes via the :default symbol, we updated the cookbook.  The Chef 11 nodes were content but all of the Chef 10 nodes were failing to converge because of the change.  It turns out that accessing default attributes via the node.default() method and node[:default] symbol are not equivalent.  To work around this, we updated the recipe to check for Chef 11 or Chef 10 behavior and assign our variables accordingly.  See below for an example illustrating this:

if node[:syslog].respond_to?(:has_key?)
    # Chef 11
    group = node[:syslog][:group] || raise("Missing group!")
    items = node[:syslog][:items]
else
    # Chef 10
    group = node.default[:syslog][:group] || raise("Missing group!")
    items = node.default[:syslog][:items]
end

In Chef 11, the :syslog symbol points to the key in the attribute namespace (it’s an ImmutableHash object) we need and responds to the .has_key?() method; in that case, we pull in the needed attributes Chef 11-style.  If the client is Chef 10, that test fails and we pull in the attributes using the .default() method.

Migration

Once we had upgraded all of our nodes and addressed any issues, it was time to migrate to the Chef 11 server.  To be certain that we could recreate the build and that our Chef 11 server cookbooks were in good shape, we rebuilt the Chef 11 server before proceeding.  Since we use a CNAME record to refer to our Chef server in the nodes’ client.rb config file, we thought that we could simply update our internal DNS systems and break for an early beer.  To be certain, however, we ran a few tests by pointing a node at the FQDN of the new Chef server.  It failed its Chef run.

Chef 10, by default, communicates to the server via HTTP; Chef 11 uses HTTPS.  In general, Chef 11 Server redirects the Chef 11 clients attempting to use HTTP to HTTPS.  However, this breaks down when the client requests cookbook versions from the server.  The client receives an HTTP 405 response.  The reason for this is that the client sends a POST to the following API endpoint to determine which versions of the cookbooks from its run_list need to be downloaded:

/environments/production/cookbook_versions

If Chef is communicating via HTTP, the POST request is redirected to use HTTPS.  No big deal, right?  Well, RFC 2616 is pretty clear that when a request is redirected, “[t]he action required MAY be carried out by the user agent without interaction with the user if and only if the method used in the second request is GET…”  When the Chef 11 client attempts to hit the /environments/cookbook_versions endpoint via GET, Chef 11 Server will respond with an HTTP 405 as it only allows POST requests to that resource.

The fix was to update all of our nodes’ client configuration files to use HTTPS to communicate with the Chef Server.  dsh (distributed shell) made this step easy.

Just before we finalized the configuration update, we put a freeze on all Chef development and used our backup and restore scripts to populate the new Chef 11 server with all the Chef objects (nodes, clients, cookbooks, data bags, etc) from the Chef 10 server.  After validating the restore operation, we completed the client configuration updates and shut down all Chef-related services on the Chef 10 server.  Our nodes happily picked up where they’d left off and continued to converge on subsequent Chef runs.

Post-migration

Following the migration, we found two issues with chef-client that required deep dives to understand, and correct, what was happening.  First, we had a few nodes whose chef-client processes were exhausting all available memory.  Initially, we switched to running chef-client in forking mode.  Doing so mitigated this issue to an extent (as the forked child released its allocated memory when it completed and was reaped) but we were still seeing an unusual memory utilization pattern.  Those nodes were running a recipe that included nested searches for nodes.  Instead of returning the node names and searching on those, we were returning whole node objects.  For a long-running chef-client process, this continued to consume available memory.  Once we corrected that issue, memory utilization fell down to acceptable levels.

See the following screenshot illustrating the memory consumption for one of these nodes immediately following the migration and after we updated the recipe to return references to the objects instead:

deploy_host_chef_nested_searches_mem_util

Here’s an example of the code in the recipe that created our memory monster:

# find nodes by role, the naughty, memory hungry way
roles = search(:role, '*:*')    # NAUGHTY
roles.each do |r|
  nodes_dev = search(:node, "role:#{r.name} AND fqdn:*dev.*")    # HUNGRY
  template "/etc/xanadu/#{r.name.downcase}.cfg" do
  ...
  variables(
    :nodes => nodes_dev
  )
  end
end

Here’s the same code example, returning object references instead:

# find nodes by role, the community-friendly, energy-conscious way
search(:role, '*:*') do |r|
  fqdns = []
  search(:node, "role:#{r.name} AND fqdn:*dev.*") do |n|
    fqdns << n.fqdn
  end
  template "/etc/xanadu/#{r.name.downcase}.cfg" do
    ...
    variables(
      :nodes => fqdns
    )
  end
end

Second, we found an issue where, in cases where chef-client would fail to connect to the server, it would leave behind its PID file, preventing future instances of chef-client from starting.  This has been fixed in version 11.6.0 of chef-client.

Conclusion

Despite running into a few issues following the upgrade, thorough testing and Opscode’s documented breaking changes helped make our migration fairly smooth. Further, the improvements made in Chef 11 have helped us improve our cookbooks. Finally, because our configuration management system is updated, we can confidently focus our attention on other issues.

Posted by on October 16, 2013
Category: operations Tags:

Related Posts

8 Comments

Hey Ryan,

Good article! Thanks for taking the time to write this up. You might want to update with a link to dsh in the body above…

Cheers,
Mikel

Ryan, any comments on Ansible?

Hey Ryan,
Really nice write up, thanks!
Can you elaborate on the nested searches idea? Maybe provide code samples of before and after, to display how you mitigated the memory consumption problem?

    Absolutely Mike!

    Using the search() method in recipes without a block, returns entire objects. Doing this every 10 minutes (especially with nested searches) in a non-forked chef-client, where the results include 1000s of objects, inconveniently uses all available system memory sooner or later. However, passing a block to search() returns references to the objects, reducing the required memory to store those objects. Further, running chef-client in forking mode ensures that memory is freed when the process exits.

    I’ve updated the post with a code example and graph of the memory consumption.

    Thanks for the feedback.

Great read Ryan!

Wondering if you could elaborate a bit on how Etsy is hosting Chef (physical hardware? aws? node count etc… ). Sounds like you were running Enterprise in a HA configuration prior to the Chef 11 upgrade.

Been tinkering with getting a Chef 11 HA setup working on AWS and it’s been … interesting so far.

    Ed,

    We run Chef on bare metal. We have a primary Chef server and run daily backups to a secondary, warm Chef server. This setup was true pre-Chef 11 as well (though we used to supplement with backups sent to Opscode’s Hosted Chef platform for a while).

    I’d love to hear about your experiences with HA and Chef 11. Feel free to email me at rfrantz [AT] etsy [DOT] com.