Putting the Dev in Devops: Bringing Software Engineering to Operations Infrastructure Tooling
At Etsy, the vast majority of our computing happens on physical servers that live in our own data centers. Since we don’t do much in the cloud, we’ve developed tools to automate away some of the most tedious aspects of managing physical infrastructure. This tooling helps us take new hardware from initial power on to being production-ready in a manner of minutes, saving time and energy for both data center technicians racking hardware and engineers who need to bring up new servers. It was only recently, however, that this toolset started getting the love and attention that really exemplifies the idea of code as craft.
The Indigo Tool Suite
The original idea for this set of tools came from a presentation on Scalable System Operations that a few members of the ops team saw at Velocity in 2012. Inspired by the Collins system that Tumblr had developed but disappointed that it wasn’t yet (at the time) open source or able to work out of the box with our particular stack of infrastructure tools, the Etsy ops team started writing our own. In homage to Tumblr’s Phil Collins tribute, we named the first ruby script of our own operations toolset after his bandmate Peter Gabriel. As that one script grew into many, that naming scheme continued, with the full suite and all its components eventually being named after Gabriel and his songs.
While many of the technical details of the architecture and design of the tool suite as it exists today are beyond the scope of this post, here is a brief overview of the different components that currently exist. These tools can be broken up into two general categories based on who uses them. The first are components used by our data center team, who handle things like unboxing and racking new servers as well as hardware maintenance and upgrades:
- sledgehammer: a command-line tool for getting a machine set up in RackTables (our datacenter asset management system) and formatting/partitioning disks as needed, such as creating RAID arrays, and configuring the out-of-band management interface. It consists of a default Cobbler profile that machines will boot to if they don’t already have an operating system, so the data center team can power on boxes and have them start installing automatically – we call this sledgehammer’s unattended mode.
- This default Cobbler profile loads a bootable disk image that runs what we call the sledgehammer executor, which performs some setup steps such as loading various configuration files and then installs and runs another software package we create that takes over for the rest of the setup steps.
- This package is called the sledgehammer payload, which consists of some shared Indigo libraries and executable code that actually handles setting up the out-of-band management interface, configuring networking, and saving hardware details back to RackTables. This is built as a separate software package to avoid the friction of rebuilding the entire disk image as often.
The other set of tools are primarily used by engineers working in the office, enabling them to take boxes that have already been set up by the data center team and sledgehammer and get them ready to be used for specific tasks:
- gabriel: a command-line tool for installing an operating system on a machine and getting it installed and configured with Chef
- zaar: a command-line tool for decommissioning boxes and removing them from production
- indigo-sweeper: a daemon that makes sure out-of-band management commands sent to servers and Cobbler syncs aren’t duplicated between multiple users and multiple runs of the tools
- indigo-tailer: a daemon that allows for live-tailing of remote server build logs on the command line and on the web
- indigo-web: a web frontend and API. The web frontend provides a friendlier interface for non-ops engineers who might not know the ins and outs of the command line tool, by providing easier-to-understand form fields rather than relying on a series of command line arguments. The API provides various functionality including an endpoint with a mutex lock to prevent multiple simultaneous builds getting the same IP address assigned to them – this allows multiple boxes to be powered on and provisioned at once without worrying about race conditions.
While many of the details of the inner workings of this automation tooling could be a blog post in and of themselves, the key aspect of the system for this post is how interconnected it is. Sledgehammer’s unattended mode, which has saved our data center team hundreds—if not thousands—of hours of adding server information to RackTables by hand, depends on the sledgehammer payload, sledgehammer executor, API, and the shared libraries that all these tools use all working together perfectly. If any one part of that combination isn’t working with the others, the whole thing breaks, which gets in the way of people, especially our awesome data center team, getting their work done.
Over the years, many many features have been added to Indigo, and as members of the operations team worked to add those features, they tried to avoid breaking things in the process. But testing had never been high on Indigo’s list of priorities – when people started working on it, they thought of it more as a collection of ops scripts that “just work” rather than a software engineering project. Time constraints sometimes played a role as well – for example, sledgehammer’s unattended mode in all its complex glory was rolled out in one afternoon, because a large portion of our recent data center move was scheduled for the next day and it was more important at that point to get that feature out for the DC team to use than it was to write tests.
For years, the only way of testing Indigo’s functionality was to push changes to production and see what broke—certainly not an ideal process! A lack of visibility into what was being changed compounded the frustration with this process.
When I started working on Indigo, I was one of the first people to have touched that code that has a formal computer science background, so one of the first things I thought of was adding unit tests, like we have for so much else of the code we write at Etsy. I soon discovered that, because the majority of the Indigo code had been written without testability in mind, I was going to have to do some significant refactoring to even get to the point where I could start writing unit tests, which meant we had to first lay some groundwork in order to be able to refactor without being too disruptive to other users of these tools. Refactoring first without any way to test the impact of my changes on the data center team was just asking for everyone involved to have a bad time.
Adding Tests (and Testability)
Some of the most impactful changes we’ve made recently have been around finding ways to test the previously untestable unattended sledgehammer components. Our biggest wins in this area have been:
- Adding a test mode for unattended sledgehammer. The command-line sledgehammer tool has always created a configuration file specific to the server being built with that run of the command, and if the config file wasn’t present, the sledgehammer payload would run in unattended mode. By adding a flag to the command-line version, I was able to force the code to run in this unattended mode easily. This also allowed me to set up other options for testing with our existing command line tools, such as…
- Adding versioning to the sledgehammer payload. Previously, there was only ever one version of the payload rpm—the build script was hard-coded to just always use 0.1. I added an option to build a different version number so that people could test that it worked before creating a new production version. Running command-line sledgehammer in unattended mode could then be told to use this new version instead.
- Adding a test version of the sledgehammer disk image/Cobbler profile. By creating a new cobbler profile and making a few minor changes to the script that we use to build the sledgehammer executor, disk image, and payload, we added the ability to test changes to this part of the process as well. Though the executor code only changed very infrequently, if it broke it would significantly get in the way of the data center team getting their work done, so being able to test it makes life much better for them.
- Adding an option to run against a different API URL. Using the server-specific configuration files mentioned previously, I was able to allow people to specify the URL for the Indigo API they wanted to use. This allowed people to run a local version of the API to test changes to it and then test all the build tools against that—a much better solution than just deploying to production and hoping things worked! For example:
testhost.yml payload: "sledgehammer-payload-0.5-test-1.x86_64.rpm" unattended: "true" unattended_run_recipient: "email@example.com" indigo_url: "http://testindigo.etsy.com:12345/api/v1"
With changes like these in place, we are able to have much more confidence that our changes won’t break the unattended sledgehammer tool that is so critical for our data center team. This enables us to more effectively refactor the Indigo codebase, whether that be to improve it in general or to make it more testable.
I gave a presentation at OpsSchool, our internal series of lectures designed to educate people on a variety of operations-related topics inspired by opsschool.org, on how to change the Indigo code to make it better suited to unit testing. Unit testing itself is beyond the scope of this post, but for us, this has meant things like changing method signatures so that objects that might be mocked or stubbed out can be passed in during tests, or splitting up large gnarly methods that grew organically along with the Indigo codebase over the past few years into smaller, more testable pieces. This way, other people on the team are able to help write unit tests for all of Indigo’s shared library code as well.
Deploying, Monitoring, and Planning
As mentioned previously, one of the biggest headaches with this tooling had been keeping all the different moving pieces in sync when people were making changes. To fix this, we decided to leverage the work that had already been put into Deployinator by our dev tools team. We created an Indigo deployinator stack that, among other things, ensures that the shared libraries, API, command line tools, and sledgehammer payload are all deployed at the same time. It keeps these deploys in sync, handles the building of the payload RPM, and restarts all the Indigo services to make sure that we never again run into issues where the payload stops working because it didn’t get updated when one of its shared library files did or vice versa.
Additionally, it automatically emails release notes to everyone who uses the Indigo toolset, including our data center team. These release notes, generated from the git commit logs for all the commits being pushed out with a given deploy, provide some much-needed visibility into how the tools are changing. Of course, this meant making sure everyone was on the same page with writing commit messages that will be useful in this context! This way the data center folks, geographically removed from the ops team making these changes, have a heads up when things might be changing with the tools they use.
Finally, we’re changing how we approach the continued development and maintenance of this software going forward. Indigo started out as a single ruby script and evolved into a complex interconnected set of tools, but for a while the in-depth knowledge of all the tools and their interconnections existed solely in the heads of a couple people. Going forward, we’re documenting not only how to use the tools but how to develop and test them, and encouraging more members of the team to get involved with this work to avoid having any individuals be single points of knowledge. We’re keeping testability in mind as we write more code, so that we don’t end up with any more code that has to be refactored before it can even be tested. And we’re developing with an eye for the future, planning what features will be added and which bugs are highest priority to fix, and always keeping in mind how the work we do will impact the people who use these tools the most.
Operations engineers don’t think of ourselves as developers, but there’s a lot we can learn from our friends in the development world. Instead of always writing code willy-nilly as needed, we should be planning how to best develop the tooling we use, making sure to be considerate of future-us who will have to maintain and debug this code months or even years down the line.
Tools to provision hardware in a data center need tests and documentation just as much as consumer-facing product code. I’m excited to show that operations engineers can embrace the craftsmanship of software engineering to make our tools more robust and scalable.