Introducing nagios-herald

Posted by on June 6, 2014

Alert Design

Alert design is not a solved problem. And it interests me greatly.

What makes for a good alert? Which information is most relevant when a host or service is unavailable? While the answer to those, and other, questions depends on a number of factors (including what the check is monitoring, which systems and services are deemed critical, what defines good performance, etc.), at a minimum, alerts should contain some amount of appropriate context to aid an on-call engineer in diagnosing and resolving an alerting event.

When writing Nagios checks, I ask the following questions to help suss out what may be appropriate context:

On the last point, about automating work, I believe that computers can, and should, do as much work as possible for us before they have to wake us up. To that end, I’m excited to release nagios-herald today!

nagios-herald: Rub Some Context on It

nagios-herald was created from a desire to supplement an on-call engineer’s awareness of conditions surrounding a notifying event. In other words, if a computer is going to page me at 3AM, I expect it to do some work for me to help me understand what’s failing. At its core, nagios-herald is a Nagios notification script. The power, however, lies in its ability to add context to Nagios alerts via formatters.

One of the best examples of nagios-herald in action is comparing the difference between disk space alerts with and without context.

Disk Space Alert

disk_space_no_context

I’ve got a vague idea of which volume is problematic but I’d love to know more. For example, did disk space suddenly increase? Or did it grow gradually, only tipping the threshold as my head hit the pillow?

Disk Space Alert, Now *with* Context!

disk_space_with_context

In the example alert above, a stack bar clearly illustrates which volume the alert has fired on. It includes a Ganglia graph showing the gradual increase in disk storage over the last 24 hours. And the output of the df command is highlighted, helping me understand which threshold this check exceeded.

For more examples of nagios-herald adding context, see the example alerts page in the GitHub repo.

“I Have Great Ideas for Formatters!”

I’m willing to bet that at some point, you looked at a Nagios alert and thought to yourself, “Gee, I bet this would be more useful if it had a little more information in it…”  Guess what? Now it can! Clone the nagios-herald repo, write your own custom formatters, and configure nagios-herald to load them.

I look forward to feedback from the community and pull requests!

Ryan tweets at @Ryan_Frantz and blogs at ryanfrantz.com.

Posted by on June 6, 2014
Category: engineering, infrastructure, monitoring, operations, people

3 Comments

looks great!

Holy crap you just made nagios actually useful.

[…] StatsD, Kale and nagios-herald, Etsy has long established itself as the most prolific open-source contributor to the monitoring […]