Scaling CI at Etsy: Divide and Concur, Revisited
In a past post, Divide and Concur, we told you how we approached dividing our large test suite into smaller test suites by keeping similar tests together rather than arbitrarily dividing.
Dividing tests by common points of error made triaging failures systemic failures quick, and enticed everyone to write faster, more deterministic tests, but not all was perfect. Our Jenkins dashboard was quite verbose.
The numerous jobs on our dashboard were great for pinpointing where the failures were, but it was difficult to determine at which stage of the deploy pipeline the failures existed. Some tests were executed on every commit. Some tests were executed when the QA button was pushed. Some tests were executed against a freshly pushed Princess or Production build. We were using the Jenkins IRC plugin, and the number of messages per hour was drowning out necessary communication in the
We needed some way to communicate the test status at each stage of the deployment pipeline.
We considered using Downstream Jobs, but fingerprinting was awkward and difficult to set up, and all-in-all it wasn’t quite what we were looking for.
We also considered Matrix Jobs, but a Matrix Job is designed to execute several jobs with parameter(s) varied along configuration vector(s), i.e. build node, operating system, browser, arbitrary parameter, etc. This was not a fit for the purpose because our jobs had wildly different configurations that could not be coerced into mere parameter differences.
What we needed was a way to create a Jenkins job type that would execute a selection of arbitrary Jenkins jobs, wait for the jobs to finish, and report a single result while still making it possible to drill down to sub-jobs to determine the sources of failures.
So we wrote a Jenkins plugin to achieve this, the Jenkins Master Project Plugin.
Now our Jenkins dashboard represents the deployment pipeline:
When a stage turns red (or yellow), you can click through to that particular Master Build, see what tests failed and drill through the results (or even rebuild).
The Triggering User and Master Project plugins are both integral to our latest version of Try.
We have also made our Nagios plugin for Jenkins readily available on the Etsy GitHub account. We used this for experimenting with alerting on the health of Jenkins.
All of these plugins are freely available on GitHub under the Etsy organization. Enjoy!