Putting the Dev in Devops: Bringing Software Engineering to Operations Infrastructure Tooling

Posted by on February 22, 2016

At Etsy, the vast majority of our computing happens on physical servers that live in our own data centers. Since we don’t do much in the cloud, we’ve developed tools to automate away some of the most tedious aspects of managing physical infrastructure. This tooling helps us take new hardware from initial power on to being production-ready in a manner of minutes, saving time and energy for both data center technicians racking hardware and engineers who need to bring up new servers. It was only recently, however, that this toolset started getting the love and attention that really exemplifies the idea of code as craft.

The Indigo Tool Suite

The original idea for this set of tools came from a presentation on Scalable System Operations that a few members of the ops team saw at Velocity in 2012. Inspired by the Collins system that Tumblr had developed but disappointed that it wasn’t yet (at the time) open source or able to work out of the box with our particular stack of infrastructure tools, the Etsy ops team started writing our own. In homage to Tumblr’s Phil Collins tribute, we named the first ruby script of our own operations toolset after his bandmate Peter Gabriel. As that one script grew into many, that naming scheme continued, with the full suite and all its components eventually being named after Gabriel and his songs.

While many of the technical details of the architecture and design of the tool suite as it exists today are beyond the scope of this post, here is a brief overview of the different components that currently exist. These tools can be broken up into two general categories based on who uses them. The first are components used by our data center team, who handle things like unboxing and racking new servers as well as hardware maintenance and upgrades:

The other set of tools are primarily used by engineers working in the office, enabling them to take boxes that have already been set up by the data center team and sledgehammer and get them ready to be used for specific tasks:

The interface to install a new server with the Gabriel tool

 

While many of the details of the inner workings of this automation tooling could be a blog post in and of themselves, the key aspect of the system for this post is how interconnected it is. Sledgehammer’s unattended mode, which has saved our data center team hundreds—if not thousands—of hours of adding server information to RackTables by hand, depends on the sledgehammer payload, sledgehammer executor, API, and the shared libraries that all these tools use all working together perfectly. If any one part of that combination isn’t working with the others, the whole thing breaks, which gets in the way of people, especially our awesome data center team, getting their work done.

The Problem

Over the years, many many features have been added to Indigo, and as members of the operations team worked to add those features, they tried to avoid breaking things in the process. But testing had never been high on Indigo’s list of priorities – when people started working on it, they thought of it more as a collection of ops scripts that “just work” rather than a software engineering project. Time constraints sometimes played a role as well – for example, sledgehammer’s unattended mode in all its complex glory was rolled out in one afternoon, because a large portion of our recent data center move was scheduled for the next day and it was more important at that point to get that feature out for the DC team to use than it was to write tests.

For years, the only way of testing Indigo’s functionality was to push changes to production and see what broke—certainly not an ideal process! A lack of visibility into what was being changed compounded the frustration with this process.

When I started working on Indigo, I was one of the first people to have touched that code that has a formal computer science background, so one of the first things I thought of was adding unit tests, like we have for so much else of the code we write at Etsy. I soon discovered that, because the majority of the Indigo code had been written without testability in mind, I was going to have to do some significant refactoring to even get to the point where I could start writing unit tests, which meant we had to first lay some groundwork in order to be able to refactor without being too disruptive to other users of these tools. Refactoring first without any way to test the impact of my changes on the data center team was just asking for everyone involved to have a bad time.

Adding Tests (and Testability)

Some of the most impactful changes we’ve made recently have been around finding ways to test the previously untestable unattended sledgehammer components. Our biggest wins in this area have been:

testhost.yml

payload: "sledgehammer-payload-0.5-test-1.x86_64.rpm"
unattended: "true"
unattended_run_recipient: "testuser@etsy.com"
indigo_url: "http://testindigo.etsy.com:12345/api/v1"

With changes like these in place, we are able to have much more confidence that our changes won’t break the unattended sledgehammer tool that is so critical for our data center team. This enables us to more effectively refactor the Indigo codebase, whether that be to improve it in general or to make it more testable.

I gave a presentation at OpsSchool, our internal series of lectures designed to educate people on a variety of operations-related topics inspired by opsschool.org, on how to change the Indigo code to make it better suited to unit testing. Unit testing itself is beyond the scope of this post, but for us, this has meant things like changing method signatures so that objects that might be mocked or stubbed out can be passed in during tests, or splitting up large gnarly methods that grew organically along with the Indigo codebase over the past few years into smaller, more testable pieces. This way, other people on the team are able to help write unit tests for all of Indigo’s shared library code as well.

Deploying, Monitoring, and Planning

As mentioned previously, one of the biggest headaches with this tooling had been keeping all the different moving pieces in sync when people were making changes. To fix this, we decided to leverage the work that had already been put into Deployinator by our dev tools team. We created an Indigo deployinator stack that, among other things, ensures that the shared libraries, API, command line tools, and sledgehammer payload are all deployed at the same time. It keeps these deploys in sync, handles the building of the payload RPM, and restarts all the Indigo services to make sure that we never again run into issues where the payload stops working because it didn’t get updated when one of its shared library files did or vice versa.

Deploying the various components of Indigo with Deployinator

Additionally, it automatically emails release notes to everyone who uses the Indigo toolset, including our data center team. These release notes, generated from the git commit logs for all the commits being pushed out with a given deploy, provide some much-needed visibility into how the tools are changing. Of course, this meant making sure everyone was on the same page with writing commit messages that will be useful in this context! This way the data center folks, geographically removed from the ops team making these changes, have a heads up when things might be changing with the tools they use.

Email showing release notes for Indigo generated by Deployinator

Finally, we’re changing how we approach the continued development and maintenance of this software going forward. Indigo started out as a single ruby script and evolved into a complex interconnected set of tools, but for a while the in-depth knowledge of all the tools and their interconnections existed solely in the heads of a couple people. Going forward, we’re documenting not only how to use the tools but how to develop and test them, and encouraging more members of the team to get involved with this work to avoid having any individuals be single points of knowledge. We’re keeping testability in mind as we write more code, so that we don’t end up with any more code that has to be refactored before it can even be tested. And we’re developing with an eye for the future, planning what features will be added and which bugs are highest priority to fix, and always keeping in mind how the work we do will impact the people who use these tools the most.

Conclusion

Operations engineers don’t think of ourselves as developers, but there’s a lot we can learn from our friends in the development world. Instead of always writing code willy-nilly as needed, we should be planning how to best develop the tooling we use, making sure to be considerate of future-us who will have to maintain and debug this code months or even years down the line.

Tools to provision hardware in a data center need tests and documentation just as much as consumer-facing product code. I’m excited to show that operations engineers can embrace the craftsmanship of software engineering to make our tools more robust and scalable.

Posted by on February 22, 2016
Category: engineering, infrastructure, operations

6 Comments

Thanks for sharing these tools. Would be a great help for noobies!

Pretty awesome.

These tools are really helpful. Thanks for sharing this.

Sorry if I’m missing something obvious, but are these open source? I can’t seem to find them on https://github.com/etsy

Thanks for sharing. It’s more technical but quite useful.

I think this tool which you mentioned here is not a open source tool.It is not available on github as well. but i appreciate you shared such a technical post here. As a software engineer i loved to read this post.thanks for sharing.