In the world of Ops, monitoring is a tough problem. It gets harder when you have lots and lots of critical moving parts, each requiring constant monitoring. At Etsy, we’ve got a bunch of tools that we use to help us monitor our systems. You might be familiar with some of them: Nagios, StatsD, Graphite, and Ganglia. Today, we’d like to introduce you to a new tool that we’ve been working on for the past few months.
This tool is designed to solve the problem of metrics overload. What kind of overload are we talking about? Well, at Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage.
Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.
We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.
Skyline is an anomaly detection system. It shows all the current metrics that have been determined to be anomalous:
You can hover over all the metric names and view the graphs directly. Our algorithms do a good job filtering most of the non-anomalous metrics, but for now, they certainly aren’t as good at humans at pattern matching. So, our philosophy is to err on the side of noise – it’s very easy to scroll through a few more false positives if it means you get to catch all the real anomalies.
Once you’ve found a metric that looks suspect, you can click through to Oculus and analyze it for correlations with other metrics!
Oculus is the anomaly correlation component of the Kale system. Once you’ve identified an interesting or anomalous metric, Oculus will find all of the other metrics in your systems which look similar.
It lets you search for metrics, using your choice of two comparison algorithms…
and shows you other metrics which are similar for the same time period.
You can even save interesting metrics into a collection (complete with your own notes), for example if a particular set of related graphs occurred during a complex outage:
Oculus will then search through all of your saved metric collections along with your other metrics, and will show you any matches alongside your regular search results. What’s that? You’re seeing the same pattern of graphs as you saved during that last site outage you had? Gee, thanks Oculus!
Going further into the juicy technical depths of each system is a tad beyond the scope of this introductory post, but we’ll be sure to post more about it in the coming weeks. We will also be speaking about it at Velocity next week (see the abstract here), so if you’re around, come and say hi! In the meantime, we are open sourcing everything today, and we are looking forward to feedback from the community. Go ahead and try it out! You can find Skyline and Oculus at http://github.com/etsy/skyline and http://github.com/etsy/oculus.
Abe and Jon
Update: It came to our attention that the “Loupe” name is also used by a commercial product, and so we have changed the name to Kale.