Nagios, Sleep Data, and You

Posted by on September 28, 2013

Gettin’ Shuteye

Ian Malpass once commented that “[i]f Engineering at Etsy has a religion, it’s the Church of Graphs.”  And I believe!  Before I lay me down to sleep during an on-call shift, I say a little prayer that should something break, there’s a graph somewhere I can reference.  Lately, a few of us in Operations have begun tracking our sleep data via Jawbone UPs.  After a few months of this we got to wondering how this information could be useful, in the context of Operations.  Sleep is important.  And being on call can lead to interrupted sleep.  Even worse, after being woken up, the amount of time it takes to return to sleep varies by person and situation.  So, we thought, “why not graph the effect of being on call against our sleep data?”

Gathering and Visualizing Data

We already visualize code deploys against the myriad graphs we generate, to lend context to whatever we’re measuring.  We use Nagios to alert us to system and service issues.  Since Nagios writes consistent entries to a log file, it was a simple matter to write a Logster parser to ship metrics to Graphite when a host or service event pages out to an operations engineer.  Those data points can then be displayed as “deploy lines” against our sleep data.

For the sleep data we used, and extended, Aaron Parecki’s ‘jawbone-up‘ gem to gather sleep data (summary and detail information) via Jon Cowie’s handy ‘jawboneup_to_graphite‘ script on a daily basis.  Those data are then displayed on personal dashboards (using Etsy’s Dashboard project).

Results

So far, we’ve only just begun to collect and display this information.  As we learn more, we’ll be certain to share our findings.  In the meantime, here are examples from recent on-call shifts.

nagios_deploy_lines_prototype

This engineer appeared to get some sleep!

laurie_on_call_sleep_detail

Here, the engineer was alerted to a service in the critical state in the wee hours of the morning. From this graph we can tell that he was able to address the issue fairly quickly, and most importantly, get back to sleep fast.

NOTE:  Jawbone recently opened up their API.  Join the party and help build awesome apps and tooling around this device!

Posted by on September 28, 2013
Category: monitoring, operations

15 Comments

I remember you talking about this at a recent lopsa-east meeting. Am curious to see where you’ll take this data. Maybe you can correlate it with response time or the like?

Though that said, 0200 is not the wee hours of the morning… it’s bedtime. 🙂

    With respect to response time, we do send metrics to Graphite when an ACK is received. However, tracking that is not the focus of this project.

I’m a pretty open person, but it would feel a little weird sharing my sleeping habits with my workplace hahaha

    Granted. We have a very high level of trust on the Ops team at Etsy. I’m willing to share a little information about myself, in return for being able to learn more about the impact to sleep from being on call.

Your graphs should also include the color of your engineer’s hair.

Seriously though, do you really need a graph to know the impact of being on call?

We all know it sucks, so why doesn’t Etsy hire overseas engineers to handle these events while they’re awake, and you’re sleeping?

    Do we need to graph it to know that it sucks? Definitely not. However, that is merely a qualitative observation. We want to quantify the effect it has. For example, learning which alerts may be more severe, require more time to address and resolve, and therefore result in a longer time to get back to sleep for the engineer, we might decide to focus on improving relevant monitoring and tooling to handle those issues. Finally, hiring engineers just to deal with the fact that being on call sucks only moves the issue elsewhere. Our hope is that by gathering more data about being on call, the better we can make the experience.

Have you considered also how to track the impact to one’s day job when on call? How much productivity is lost from a rough on-call night?

Some interesting metrics to understand would be:
– Do schedules get missed?
– Does that person have to work more after their on-call is over to catch up on work?
– Are more bugs created when someone is on call?
– How is moral affected? Both from the burden of being on call and when an on-call person is tired and grumpy (yes, happens to all of us though we know we shouldn’t act that way) and doesn’t play nice with co-workers?

I love the idea of the people best suited to fix the root problems being on call and responding to the problems, but I also wonder if that affects their ability to actually make the fix. Do you end up at a point moving to the “classic” model of a 24×7 NOC type group handling the low hanging fruit or can you avoid that path.

Out of curiosity, what is the on-call rotation? How often is someone on-call and for how long?

    These are all excellent questions. And I don’t have any of the answers, especially quantified! In short, the easy answer is “it depends”. Some on-call rotations are lighter than others, in terms of alert volume and severity; others are heavier. Some people need more or less sleep than others so are more or less affected by sleep disruption.

    Morale can definitely be affected while being on call. The Jawbone offers wearers emoticons that can be applied to sleep events; if we used that more consistently, I’m sure we could compare on-call rotations with mood.

    At Etsy, Operations members are on call for a week at a time and each team member goes on call about every 2 months.

A lot of this research has been done in other (more life-critical) professions. You may find this PDF from U.S. Army Research Labs useful in informing your hypotheses:

“Leaders Guide to Crew Endurance”: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA433702

This research has been conducted in other (more life-critical) professions. You may find this guide from U.S. Army Aeromedical Research Lab useful in informing your hypotheses:

“Leader’s Guide to Crew Endurance”
http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA433702

[…] by a recent blog post from Etsy I set out to understand the human side of […]

[…] Frantz wrote here about getting that data available for all to share on our dashboards, using conveniently accessible APIs, and it wasn’t long […]