Introducing Kale

Posted by on June 11, 2013

In the world of Ops, monitoring is a tough problem. It gets harder when you have lots and lots of critical moving parts, each requiring constant monitoring. At Etsy, we’ve got a bunch of tools that we use to help us monitor our systems. You might be familiar with some of them: Nagios, StatsD, Graphite, and Ganglia. Today, we’d like to introduce you to a new tool that we’ve been working on for the past few months.

This tool is designed to solve the problem of metrics overload. What kind of overload are we talking about? Well, at Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage.

Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.

We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.

Skyline

Skyline is an anomaly detection system. It shows all the current metrics that have been determined to be anomalous:

screenshot

You can hover over all the metric names and view the graphs directly. Our algorithms do a good job filtering most of the non-anomalous metrics, but for now, they certainly aren’t as good at humans at pattern matching. So, our philosophy is to err on the side of noise – it’s very easy to scroll through a few more false positives if it means you get to catch all the real anomalies.

Once you’ve found a metric that looks suspect, you can click through to Oculus and analyze it for correlations with other metrics!

Oculus

Oculus is the anomaly correlation component of the Kale system. Once you’ve identified an interesting or anomalous metric, Oculus will find all of the other metrics in your systems which look similar.

It lets you search for metrics, using your choice of two comparison algorithms…

search_algo

and shows you other metrics which are similar for the same time period.

results_screenshot

You can even save interesting metrics into a collection (complete with your own notes), for example if a particular set of related graphs occurred during a complex outage:

collection_save

Oculus will then search through all of your saved metric collections along with your other metrics, and will show you any matches alongside your regular search results. What’s that? You’re seeing the same pattern of graphs as you saved during that last site outage you had? Gee, thanks Oculus!

collection_results

Going further into the juicy technical depths of each system is a tad beyond the scope of this introductory post, but we’ll be sure to post more about it in the coming weeks. We will also be speaking about it at Velocity next week (see the abstract here), so if you’re around, come and say hi! In the meantime, we are open sourcing everything today, and we are looking forward to feedback from the community. Go ahead and try it out! You can find Skyline and Oculus at http://github.com/etsy/skyline and http://github.com/etsy/oculus.

monitoring <3,
Abe and Jon

Abe tweets at @abestanway, and Jon tweets at @jonlives

Update: It came to our attention that the “Loupe” name is also used by a commercial product, and so we have changed the name to Kale.

Posted by on June 11, 2013
Category: data, monitoring, operations

19 Comments

Ah monitoring. Large screens mounted on walls. Brings back good memories!

Great job guys. Feels nostalgic as we also have similar metric overload problem, will check out the tools.

Amazing! Can’t wait to try this out!

We also added a predictive part of this to https://github.com/wayfair/Graphite-Tattle
It monitors metrics for a defined period for what is normal 7, 14, 21 days etc for that same time period. That way metrics that change through out the day or hour etc that stray out of a defined set of standard deviations alert you.

Whoah! You guys at Etsy rock, again, as always. Anomaly detection and correlation – the essential part of monitoring that has always been missing. Can’t wait to try this out!

[…] Introducing Loupe — Etsy’s monitoring stack. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem. […]

Neat! Like hash tagging for metrics! Do you have a similar tool for finding and tagging metric definition anomalies?

    For now, all algorithms get applied to all metrics. That could change if people fine that certain metrics require significantly different algorithmic treatment, but part of the goals of Skyline was minimal parameter configuration for algorithms across all metrics.

can skyline pull data from a db?

    It cannot, currently. You could write yourself a db streamer that queries the database and sends the results over the wire to the Horizon service, though.

One thing I didn’t really hear mentioned is alerting and manual threshold settings for anomaly detection. Often that is useful, especially for detecting a queue exceeding a specific watermark, ssuing a warning before a rate limit threshold is exceeded, or detecting the very first error condition in a critical system.. Do these projects cover that aspect? There has yet to be a solid Graphite alerting tool that I have found.

    At the moment, Skyline doesn’t alert – our algorithms err on the side of noise and so alerting would be very noisy. It would not be hard to build manual thresholds into it, but if we did, we would have effectively re-written Nagios. You can use Nagios to check for thresholds in Graphite, as well – http://obfuscurity.com/2012/05/Polling-Graphite-with-Nagios.

    I can’t remember who said it, but I recall the quote “all monitoring software evolves towards becoming an implementation of Nagios.” Of course, there’s also Lett’s Law: all software evolves until it can send email – so alerting will most likely happen eventually :)

Love the tools, regardless tho! Keep up the great work, statsd is so simple and awesome!

Awesome tool guys !!

Looks nifty!

However it does look like Skyline will be pretty noisy — is that an alarm from a single anomalous data point?

My experience has been that automated detection of metric deviation from norms is prone to false positives; certain metrics can work well in this situation but the majority seem to deviate from norms pretty frequently (meaning that “norms” are complex to define).

Even Amazon’s order drop alarm tended to misfire frequently, and it was a carefully tended single alarm. eg.: https://twitter.com/vijayravindran/status/252954520148140032

I’d love to hear how you get on with this…

[…] techniques. For example, techniques such as the three-sigma rule or the Grubbs score (check out kale, the most excellent tool introduced by @abestanway at Etsy) are only meaningful if your data has a […]

I just started sending 21000 distinct metrics to skyline. Now I’m just sitting back and waiting for the anomalies to be detected…

[…] warning systems get trickier to implement when one is tracking lots of metrics simultaneously. Etsy’s Kale stack targets the specific problem of monitoring lots of interdependent time-series. It includes tools […]

[…] Traditionally alerting has been done on current values, but anomaly detection and forecasting are becoming a reality thanks to some work done at Etsy. […]