Q2 2014 Site Performance Report

Posted by on August 1, 2014 / 3 Comments

As the summer really starts to heat up, it’s time to update you on how our site performed in Q2. The methodology for this report is identical to the Q1 report. Overall it’s a mixed bag: some pages are faster and some are slower. We have context for all of it, so let’s take a look.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, July 16th:

Server Side Performance

A few things stand out in this data:

Median homepage performance improved, but the 95th percentile got slower. This is due to a specific variant we are testing which is slower than the current page. We made some code changes that improved load time for the majority of pageviews, but the one slower test variant brings up the higher percentiles.

The listing page saw a fairly large increase in both median and 95th percentile load time. There isn’t a single smoking gun for this, but rather a number of small changes that caused little increases in performance over the last three months.

Search saw a significant decrease across the board. This is due to a dedicated memcached cluster that we rolled out to cache “listing cards” on the search results page. This brought our cache hit rate for listing related data up to 100%, since we automatically refresh the cache when the data changes. This was a nice win that will be sustainable over the long term.

The shop page saw a big jump at the 95th percentile. This is again due to experiments we are running on this page. A few of the variants we are testing for a shop redesign are slower than the existing page, which has a big impact on the higher percentiles. It remains to be seen which of these variants will win, and which version of the page we will end up with.

Overall we saw more increases than decreases on the backend, but we had a couple of performance wins from code/architecture changes, which is always nice to see. Looking ahead, we are planning on replacing the hardware in our memcached cluster in the next couple of months, and tests show that this should have a positive performance impact across the entire site.

Synthetic Front-end Performance

As a reminder, these tests are run with Catchpoint. They use IE9, and they run from New York, London, Chicago, Seattle, and Miami every two hours. The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page. Here is that data:

Synthetic Performance

The render start metrics are pretty much the same across the board, with a couple of small decreases that aren’t really worth calling out due to rounding error and network variability. The “webpage response” numbers, on the other hand, are up significantly across the board. This is easily explained: we recently rolled out full site TLS, and changed our synthetic tests to hit https URLs. The added TLS negotiation time for all assets on the page bumped up the overall page load time everywhere. One thing we noticed with this change is that due to most browsers making six TCP connections per domain, we pay this TLS negotiation cost many times per page. We are actively investigating SPDY with the goal of sending all of our assets over one connection and only doing this negotiation once.

Real User Front-end Performance

As always, these numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

RUM Data

One change here is that we are showing an extra significant figure in the RUM data. We increased the number of beacons that we send to mPulse, and our error margin dropped to 0.00 seconds, so we feel confident showing 10ms resolution. We see the expected drop in search load time because of the backend improvement, and everything else is pretty much neutral. The homepage ticked up slightly, which is expected due to the experiment that I mentioned in the server side load time section.

One obvious question is: “Why did the synthetic numbers change so much while the RUM data is pretty much neutral?”. Remember, the synthetic numbers changed primarily because of a change to the tests themselves. The switch to https caused a step change in our synthetic monitoring, but for real users the rollout was gradual. In addition, real users that see more than one page have some site resources in their browser cache, mitigating some of the extra TLS negotiations. Our synthetic tests always operate with an empty cache, which is a bit unrealistic. This is one of the reasons why we have both synthetic and RUM metrics: if one of them looks a little wonky we can verify the difference with other data. Here’s a brief comparison of the two, showing where each one excels:

Synthetic Monitoring Real User Monitoring
Browser Instrumentation Navigation Timing API
Consistent Trending over time Can be highly variable as browsers and networks change
Largely in your control Last mile difficulties
Great for identifying regressions Great for comparing across geographies/browsers
Not super realistic from an absolute number point of view “Real User Monitoring”
A/B tests can show outsized results due to empty caches A/B tests will show the real world impact


This report had some highs and some lows, but at the end of the day our RUM data shows that our members are getting roughly the same experience they were a few months ago performance wise, with faster search results pages. We’re optimistic that upgrading our memcached cluster will put a dent in our backend numbers for the next report, and hopefully some of our experiments will have concluded with positive results as well. Look for another report from us as the holiday season kicks off!


Just Culture resources

Posted by on July 18, 2014 / 1 Comment

This is a (very) incomplete list of resources that may be useful to those wanting to find out more about the ideas behind Just Culture, blamelessness, complex systems, and related topics. It was created to support my DevOps Days Minneapolis talk Fallible Humans.

Human error and sources of failure

Just Culture

Postmortems and blameless reviews

Complex systems and complex system failure

1 Comment

Calendar Hacks

Posted by on July 15, 2014 / 7 Comments

As an engineering manager, there’s one major realization you have: managers go to lots of meetings. After chatting with a bunch of fellow engineering managers at Etsy, I realized that people have awesome hacks for managing their calendars and time. Here are some of the best ones from a recent poll of Etsy engineering managers! We’ll cover tips on how to:

To access any of the Labs settings/apps:


Block out time

Create big unscheduled blocks every week. It allows for flexibility in your schedule. Some block out 3 afternoons/week as “office hours — don’t schedule unless you’re on my team”. It creates uninterrupted time when I’m *not* in meetings and available on IRC. Some book time to work on specific projects, and mark the event as private. They’ll try to strategize the blocking to prevent calendar fragmentation.

2-auto-declineAutomatically decline events (Labs): Lets you block off times in your calendar when you are unavailable. Invitations sent for any events during this period will be automatically declined. After you enable this feature, you’ll find a “Busy (decline invitations)” option in the “Show me as” field.

Office Hours: Blocks for office hours allow you to easily say, “yes, I want to talk to you, but can we schedule it for a time I’ve already set aside?” Better than saying “No, I cannot talk to you, my calendar is too full.” (Which also has to happen from time to time.) When you create a new event on your calendar, choose the “Appointment slots” link in the top of the pop-up:


Then follow the instructions to create a bookable slot on your calendar. You’ll be able to share a link to a calendar with your bookable time slots:


Declining meetings: Decline meetings with a note. Unless you work with the organizer often, do this automatically if there’s no context for the meeting. One manager has declined 1 hour meetings, apologized for missing them, asked for a follow up, and found that he really wasn’t needed in the first place. Hours saved!

Change your defaults

Shorter meetings: Don’t end meetings on the hour; use 25/50 minute blocks. You can set this in Calendars>My Calendars>Settings>General>Default Meeting Length. If you set your calendar invites to 25 minutes or 55 minutes, you need to assert that socially at the beginning of the meeting, and then explicitly do a time check at 20 minutes (“we have 5 minutes left in this meeting”). If possible, start with the 25 (or 30) minute defaults rather than the hour-long ones.


Visible vs private: Some make their calendar visible, not private by default. This lets other people see what they’re doing so they have a sense of whether they can schedule against something or not–a team meeting–maybe, mad men discussion group, no way.

Custom view: Create a custom 5- or 7-day view or use Week view.


Color coding:

Gentle Reminders (Labs): This feature replaces Calendar’s pop-ups: when you get a reminder, the title of the Google Calendar window or tab will happily blink in the background and you will hear a pleasant sound.

Event dimming: dim past events, and/or recurring future events, so you can focus.


Editability: Make all meetings modifiable and including this in the invite description, so as to avoid emails about rescheduling (“you have modifying rights — reschedule as needed, my schedule is updated”) and ask reports and your manager to do the same when they send  invites. This can result in fewer emails back and forth to reschedule.

Rely on apps and automation

Sunrise: Sunrise for iOS does the right thing and doesn’t show declined events, which weaponizes the auto-decline block.

8-world-clockWorld Clock (Labs) helps you keep track of the time around the world. Plus: when you click an event, you’ll see the start time in each time zone as well. You could alternatively add an additional timezone (settings -> general -> Your Time Zone -> Additional Timezone).

Free or busy (Labs): See which of your friends are free or busy right now (requires that friends share their Google Calendars with you.

Next meeting (Labs): See what’s coming up next in your calendar.

Soft timer: Set a timer on your phone by default at most of meetings, group or 1:1s,  and tell folks you’re setting an alarm to notify everyone when you have 5 minutes left. People love this, especially in intense 1:1s, because they don’t have to worry about the time. It can go really well as a reminder to end in “Speedy Meeting” time.

Do routine cleanup

Calendar review first thing in the morning. Review and clean up next three days. Always delete the “broadcast” invites you’re not going to attend.

Block off some empty timeslots first thing in the morning to make sure you have some breaks during the day—using Busy / Auto Decline technology.

People don’t seem to notice when you note all day events at the top of your calendar. If you’re going offsite,  book an event that’s 9am-7pm that will note your location.

Think ahead (long-term)

Book recurring reminders: For instance, do a team comp review every three months. Or if there is a candidate we lost out on, make appointments to follow up with them in a few months.

Limit Monday recurring meetings: Holiday weekends always screw you up when you have to reschedule all of those for later in the week.

Track high-energy hours: “I tracked my high energy hours against my bright-mind-needed tasks and lo-and-behold, realized that mornings is when I need a lot of time to do low-urgency/high importance things that otherwise I wasn’t making time for or I was doing in a harried state. I thus time-blocked 3 mornings a week where from 8am to 11am I am working on these things (planning ahead, staring at the wall thinking through a system issue, etc). It requires a new 10pm to 6am sleeping habit, but it has been fully worth it, I feel like I gained a day in my way. This means I no longer work past 6pm most days, which was actually what was most draining to me.”

Create separate calendars

Have a shared calendar for the team for PTO tracking, bootcampers, time out of the office, team standups etc.

Subscribe to the Holidays in the United States Calendar so as to not be surprised by holidays: Calendar ID: en.usa#holiday@group.v.calendar.google.com

Subscribe to the conference room calendars if that’s something your organization has.

Create a secondary calendar for non critical events, so they stay visible but don’t block your calendar. If there’s an event you are interested in, but haven’t decided on going or not, and don’t want other people to feel the need to schedule around it, you can go the event and copy it. Then remove it from your primary calendar. You can toggle the secondary calendar off via the side-panel, and if someone needs to set something up with you, you’ll be available.9-mergeWhen using multiple calendars, there may be events that are important to have on multiple calendars, but this takes up a lot of screen real estate. In these cases, we use Etsy engineer Amy Ciavolino’s Event Merge for Google Calendar Chrome extension. It makes it easy to see those events without them taking up precious screen space.

And, for a little break from the office, be sure to check out Etsy analyst Hilary Parker’s Sunsets in Google Calendar (using R!).


Threat Modeling for Marketing Campaigns

Posted by on July 7, 2014 / 1 Comment

Marketing campaigns and referrals programs can reliably drive growth. But whenever you offer something of value on the Internet, some people will take advantage of your campaign. How can you balance driving growth with preventing financial loss from unchecked fraud?

In this blog post, I want to share what we learned about how to discourage and respond to fraud when building marketing campaigns for e-commerce websites. I personally found there wasn’t a plethora of resources in the security community focused on the specific challenges faced by marketing campaigns, and that motivated me to try and provide more information on the topic for others — hopefully you find this helpful!


Since our experience came from developing a referrals program at Etsy, I want to describe our program and how we created terms of service to discourage fraud. Then rather than getting into very specific details about what we do to respond in real-time to fraud, I want to outline useful questions for anyone building these kinds of programs to ask themselves.

Our Referrals Program

We encouraged people to invite their friends to shop on the site. When the friend created an account on Etsy, we gave them $5.00 to spend on their first purchase. We also wanted to reward the person who invited them, as a thank-you gift.

Invite Your Friends - Recipient Promo

Of course, we knew that any program that gives out free money is an attractive one to try and hack!

Invite Your Friends - Sender Promo

Discourage Bad Behavior from the Start

First, we wanted to make our program sustainable through proactive defenses. When we designed the program we tried to bake in rules to make the program less attractive to attackers. However, we didn’t want these rules to introduce roadblocks in the product that made the program less valuable from users’ perspectives, or financially unsustainable from a business perspective.

In the end, we decided on the following restrictions. We wanted there to be a minimum spend requirement for buyers to apply their discount to a purchase. We hoped that requiring buyers to put in some of their own money to get the promotion would attract more genuine shoppers and discourage fraud.

We also put limits on the maximum amount we’d pay out in rewards to the person inviting their friends (though we’re keeping an eye out for any particularly successful inviters, so we can congratulate them personally). And we are currently only disbursing rewards to inviters after two of their invited friends make a purchase.

Model All Possible Scenarios

A key principle of Etsy’s approach to security is that it’s important to model out all possible approaches of attack, notice the ones that are easiest to accomplish or result in the biggest payouts if successful, and work on making them more difficult, so it becomes less economical for fraudsters to try those attacks.

In constructing these plans, we first tried to differentiate the ways our program could be attacked and by what kind of users. Doing this, we quickly realized that we wanted to respond differently to users with a record of good behavior on the site from users who didn’t have a good history or who were redeeming many referrals, possibly with automated scripts. We also wanted to have the ability to adjust the boundaries of the two categories over time.

So the second half of our defense against fraud consisted of plans on how to monitor and react to suspected fraud as well as how to prevent attackers in the worst-case scenario from redeeming large amounts of money.

Steps to Develop a Mitigation Plan

I have to admit upfront, I’m being a little ambiguous about what we’ve actually implemented, but I believe it doesn’t really matter since each situation will differ in the particulars. That being said, here are the questions that guided the development of our program, that could guide your thinking too.


# 1. How can you determine whether two user accounts represent the same person in real life?

This question is really key. In order to detect fraud on a referrals program, you need to be able to tell if the same person is signing up for multiple accounts.

In our program, we kept track of many measures of similarity. One important kind of relatedness was what we called “invite relatedness.” A person looking to get multiple credits is likely to have generated multiple invites off of one original account. To check for this, and other cases, we had to keep a graph data structure of whether one user had invited another via a referral and do a breadth first search to determine related accounts.

Another important type of connection between accounts is often called “fingerprint relatedness.” We keep track of fingerprints (unique identifying characteristics of an account) and user to fingerprint relationships, so we can look up what accounts share fingerprints. There are a lot of resources available about how to fingerprint accounts in the security community that I would highly recommend researching!

Here’s an example of a very simple invite graph of usernames, colored by fingerprint relatedness. As you can see, the root user SarahJ28 might have invited three people, but two of them are related to her via other identifying characteristics.

Simple Invite Graph


# 2. At what point in time are all the different signals of identity discussed in the previous question available to you?

You don’t know everything about a user from the instant they land on your site. At that point in time, you might only have a little bit of information about their IP and browser.  You start to learn a bit more about them based on their email if they sign up for an account, and you certainly have more substantive information for analysis when they perform other actions on the site, like making purchases.

Generally, our ability to detect identity gets stronger the more engaged someone is with the site, or the closer they move towards making a purchase. However, if someone has a credit on their account that you don’t want them to have, you need to identify them before they complete their purchase. The level of control you have over the process of purchasing will depend on how you process credit card transactions on your site.


# 3. What are the different actions you could take on user accounts if you discover that one user has invited many related accounts to a referrals program?

There’s generally a range of different actions that can be taken against user accounts on a site. Actions that were relevant to us included: banning user accounts, taking away currently earned promotional credits, blocking promotional credits from being applied to a purchase transaction, and taking away the ability to refer more people using our program.

 Signals of identity become stronger with more site engagement


# 4. Do any actions need to be reviewed by a person or can they be automatic? What’s the cost of doing a manual review?On the other hand, how would a user feel if an automated action was taken against their account based on a false positive?

We knew that in some cases we would feel comfortable programming automated consequences to user accounts, while in other cases we wanted manual review of the suspected account first. It was really helpful for us to work with the teams who would be reviewing the suspected accounts on this from the beginning. They had seen lots of types of fraud in the past and helped us calibrate our expectations around false positives from each type of signal.

Luckily for us at Etsy, it’s quite easy to code the creation of tasks for our support team to review from any part of our web stack. I highly recommend architecting this ability if you don’t already have it for your site because it’s useful in many situations besides this one. Of course, we had to be very mindful that the real cost would come from continually reviewing the tasks over time.


# 5. How can you combine what you know about identity and user behavior (at each point in time) with your range of possible actions to come up with checks that you feel comfortable with? Do these checks need to block the action a user is trying to take?

We talked about each of these points where we evaluated a user’s identity and past behavior as a “check” that could either implement one of these actions I described or create a task for a person to review.

This meant we also had to decide whether the check needed to block the user from completing an action, like getting a promotional credit on their account, or applying a promotional credit to a transaction, and how that action was currently implemented on the site.

It’s important to note that if the user is trying to accomplish something in the scope of a single web request, there is a tradeoff between how thoroughly you can investigate someone’s identity and how quickly you can return a response to the user. After all, there are over 30 million user accounts on Etsy and we could potentially need to compare fingerprints across all of them. To solve this problem, we had to figure out how to kick off asynchronously running jobs at key points (especially checkout) that could do the analysis offline, but would nevertheless provide the protection we wanted.


# 6. How visible are account statuses throughout your internal systems?

Once you’ve taken an automated or manual action against an account, is that clearly marked on that user account throughout your company’s internal tools? A last point of attack may be that someone writes in complaining their promo code didn’t work. If this happens, it’s important for the person answering that email to know they’ve been flagged as under investigation or deemed fraudulent.


# 7.  Do you have reviews of your system scheduled for concrete dates in the future? Can you filter fraudulent accounts from your metrics when you want to?

If your referrals campaign doesn’t have a concrete end date, then it’s easy to forget about how it’s performing, not just in terms of meeting business goals but in terms of fraud. It’s important to have an easy way to filter out duplicate user accounts to calculate true growth, as well as how much of an effect unchecked fraud would have had on the program and how much was spent on what was deemed an acceptable, remaining risk. If we had discovered that too much was being spent on allowing good users to get away with a few duplicate referrals, we could have tightened our guidelines and started taking away the ability to send referrals from accounts more frequently.

We found that when we opened up tasks for manual review, the team member reviewing them marked them as accurate 75% of the time. This was pretty good relative to other types of tasks we review. We were also pretty generous in the end in trusting that multiple members of a household might be using the same credit card info.


Our project revealed that some fraud is catastrophic and should absolutely be prevented, while other types of fraud, like a certain level of duplicate redemptions in marketing campaigns, are less dangerous and require a gentler response or even a degree of tolerance.

We have found it useful to review what could happen, design the program with rules to discourage all kinds fraud while keeping value to the user in mind, have automated checks as well as manual reviews, and monitoring that includes the ability to segment the performance of our program based on fraud rate.

Many thanks to everyone at Etsy, but especially our referrals team, the risk and integrity teams, the payments team, and the security team for lots of awesome collaboration on this project!

1 Comment

Device Lab Checkout – RFID Style

Posted by on June 24, 2014 / 3 Comments

You may remember reading about our mobile device lab here at Etsy.  Ensuring the site works and looks proper on all devices has been a priority for all our teams. As the percentage of mobile device traffic to the site continues to increase (currently it’s more than 50% of our traffic), so does the number of devices we have in our lab. Since it is a device lab after-all, we thought it was only appropriate to trick it out with a device oriented check-out system. Something to make it easy for designers and developers to get their hands on the device they need quickly and painlessly, and a way to keep track of who had what device. Devices are now checked in and out with just two taps, or “bumps”, to an RFID reader. And before things got too high tech, we hide all the components into a custom made woodland stump created by an amazing local Etsy seller Trish Czech.

Etsy RFID Check in/out system

“Bump the Stump” to check in and out devices from our Lab

If you’re able to make it out to Velocity Santa Clara 2014, you’ll find a full presentation on how to build a device lab. However, I’m going to focus on just the RFID aspect of our system and some of the issues we faced.

RFID – What’s the frequency

The first step in converting from our old paper checkout system to an RFID based one, was deciding on the correct type of RFID to use. RFID tags can come in a number of different and incompatible frequencies. In most corporate environments, if you are using RFID tags already, they will probably either be high frequency (13.54 mHz) or low frequency (125 kHz) tags. While we had a working prototype with high frequency NFC based RFID tags, we switched to low frequency since all admin already carry a low frequency RFID badge around with them. However, our badges are not compatible with a number of off the shelf RFID readers. Our solution was to basically take one of the readers off a door and wire it up to our new system.

You will find that most low frequency RFID readers transmit card data using the Wiegand protocol over their wires. This protocol uses two wires, commonly labeled “DATA0” and “DATA1” to transmit the card data. The number of bits each card will transmit can vary depending on your RFID system, but lets say you had a 11 bit card number which was “00100001000”. If you monitored the Data0 line, you would see that it drops from a high signal to a low signal for 40 microseconds, for each “0” bit in the card number. The same thing happens on the Data1 line for each “1” bit.  Thus if you monitor both lines at the same time, you can read the card number.

Logic Analyzer of Wiegand data

Wiegand data on the wire

We knew we wanted a system that would be low powered and compact.  Thus we wired up our RFID reader to the GPIO pins of a Raspberry Pi. The Raspberry Pi was ideal for this given its small form factor, low power usage, GPIO pins, network connectivity and USB ports (we later ported this to a BeagleBone Black to take advantage of its on-board flash storage).  Besides having GPIO pins to read the Data1 and Data0 lines, the Raspberry Pi also has pins for supplying 3.3 volts and 5 volts of power. Using this, we powered the RFID reader with the 5 volt line directly from the Raspberry Pi. However, the GPIO pins for the Raspberry Pi are 3.3 volt lines, thus the 5 volt Data1 and Data0 lines from the RFID reader could damage them over time. To fix this issue, we used a logic level converter to step down the voltage before connecting Data0 and Data1 to the Raspberry Pi’s GPIO pins.

RFID, Line Converter, LCD, and Raspberry Pi wiring

RFID, Line Converter, LCD, and Raspberry Pi wiring

A Need for Speed

After that, it is fairly easy to write some python code to monitor for transitions on those pins using the RPi.GPIO module. This worked great for us in testing, however, we started to notice a number of incorrect RFID card numbers once we released it. The issue appeared to be that the python code would miss a bit of data from one of the lines or record the transition after a slight delay. Considering a bit is only 40 microseconds long and can happen every 2 milliseconds while a card number is being read, there’s not a lot of time to read a card. While some have reportedly used more hardware to get around this issue, we found that rewriting the GPIO monitoring code in C boosted our accuracy (using a logic analyzer, we confirmed the correct data was coming into the GPIO pins, so it was an issue somewhere after that). Gordon Henderson’s WiringPi made this easy to implement. We also added some logical error checking in the code so we could better inform the user if we happened to detect a bad RFID tag read. This included getting the correct number of bits in a time window and ensuring the most significant bits matched our known values. With python we saw up to a 20% error rate in card reads, and while it’s not perfect, getting a little closer to the hardware with C dropped this to less than 3% (of detectable errors).

Dealing with Anti-Metal

One other issue we ran into was RFID tags attached to devices with metallic cases. These cases can interfere with reading the RFID tags. There are a number of manufacturers which supply high frequency NFC based tag to deal with this, however, I’ve yet to find low frequency tags which have this support and are in a small form factor. Our solution is a little bit of a Frankenstein, but has worked well so far. We’ve been peeling off the shield from on-metal high frequency tag and then attaching them to the back of our standard low frequency tags. Strong fingernails, an utility knife, and a little two sided tape helps with this process.

Removing anti-metal backing from NFC tags

Removing anti-metal backing from NFC tags to attach to Low Frequency tags

Client Code

We’ve posted the client-side code for this project on github (https://github.com/etsy/rfid-checkout) along with a parts list and wiring diagram. This system checks devices in and out from our internal staff directory on the backend. Look to the README file ways to setup your system for handling these calls. We hope those of you looking to build a device lab, or perhaps an RFID system for checking out yarn, will find this helpful.


Opsweekly: Measuring on-call experience with alert classification

Posted by on June 19, 2014 / 1 Comment

The Pager Life

On-call is a tricky thing. It’s a necessary evil for employees in every tech company, and the responsibility can weigh down on you.

And yet, the most common thing you hear is “monitoring sucks”, “on-call sucks” and so on. At Etsy, we’ve at least come to accept Nagios for what it is, and make it work for us. But what about on-call? And what are we doing about it?

You may have heard that at Etsy, we’re kinda big into measuring everything. “If it moves, graph it, if it doesn’t move graph it anyway” is something that many people have heard me mutter quietly to myself in dark rooms for years. And yet, for on call we were ignoring that advice. We just joined up once a week as a team, to have a meeting and talk about how crappy our on-call was, and how much sleep we lost. No quantification, no action items. Shouldn’t we be measuring this?

Introducing Opsweekly

And so came Opsweekly. Our operations team was growing, and we needed a good place to finally formalise Weekly Reports.. What is everyone doing? And at the same time, we needed to progress and start getting some graphs behind our on-call experiences. So, we disappeared for a few days and came back with some PHP, and Opsweekly was born.

What does Opsweekly do?

In the simplest form, Opsweekly is a great place for your team to get together, share their weekly updates, and then organise a meeting to discuss those reports (if necessary) and take notes on it. But the real power comes from Opsweekly’s built in on call classification and optional sleep tracking.

Every week, your on-call engineer visits Opsweekly, hits a big red “I was on call” button, and Opsweekly pulls in the notifications that they have received in the last week. This can be from whichever data source you desire; Maybe your Nagios instance logs into Logstash or Splunk, or you use Pagerduty for alerting.

An example of on-call report mostly filled in

An example of on-call report mostly filled in

The engineer can make a simple decision from a drop down about what category the alert falls into.

Alert categorisation choices

The list of alert classifications the user can choose from

We were very careful when we designed this list to ensure that every alert type was catered for, but also minimising the amount of choices the engineer had to try and decide from.

The most important part here is the overall category choice on the left:

Action Taken vs No Action Taken

One of the biggest complaints about on call was the noise. What is the signal to noise ratio of your alert system? Well, now we’re measuring that using Opsweekly.

opsweekly works

Percentage of Action vs No Action alerts over the last year

This is just one of the many graphs and reports that Opsweekly can generate using the data that was entered, but this is one of the key points for us: We’ve been doing this for a year and we are seeing an increasingly improving signal to noise ratio. Measuring and making changes based on that can work for your on-call too. 

The value of historical context

So how does this magic happen? By having to make conscious choices about whether the alert was meaningful or not, we can start to make improvements based on that. Move alerts to email only if they’re not urgent enough to be dealt with immediately/during the night.

If the threshold needs adjusting, this serves as a reminder to actually go and adjust the threshold; you’re unlikely to remember or want to do it when you’ve just woken up, or you’ve been context switched. It’s all about surfacing that information.

Alongside the categorisation is a “Notes” field for every alert. A quick line of text in each of these boxes provides invaluable data to other people later on (or maybe yourself!) to gain context about that alert.

Opsweekly has search built in that allows you to go back and inspect the alert time(s) that alert fired, gaining that knowledge of what each previous person did to resolve the alert before you.

Sleep Tracking

A few months in, we were inspired by an Ignite presentation at Velocity Santa Clara about measuring humans. We were taken aback… How was this something we didn’t have?

Now we realised we could have graphs of our activity and sleep, we managed to go a whole 2 days before we got to the airport for the flight home to start purchasing the increasingly common off the shelf personal monitoring devices.

Ryan Frantz wrote here about getting that data available for all to share on our dashboards, using conveniently accessible APIs, and it wasn’t long until it clicked that we could easily query that data when processing on call notifications to get juicy stats about how often people are woken up. And so we did:

Report in Opsweekly showing sleep data for an engineers on-call week

Report in Opsweekly showing sleep data for an engineers on-call week

Personal Feedback

The final step of this is helping your humans understand how they can make their lives better using this data. Opsweekly has that covered too; a personal report for each person


Available on Github now

For more information on how you too can start to measure real data about your on call experiences, read more and get Opsweekly now on Github

Velocity Santa Clara 2014

If you’re attending Velocity in Santa Clara, CA next week, Ryan and I are giving a talk about our Nagios experiences and Opsweekly, entitled “Mean Time to Sleep: Quantifying the On-Call Experience”. Come and find us if you’re in town!


Ryan Frantz and Laurie Denness now know their co-workers sleeping patterns a little too well…

1 Comment

Conjecture: Scalable Machine Learning in Hadoop with Scalding

Posted by on June 18, 2014 / 6 Comments


Predictive machine learning models are an important tool for many aspects of e-commerce.  At Etsy, we use machine learning as a component in a diverse set of critical tasks. For instance, we use predictive machine learning models to estimate click rates of items so that we can present high quality and relevant items to potential buyers on the site.  This estimation is particularly important when used for ranking our cost-per-click search ads, a substantial source of revenue. In addition to contributing to on-site experiences, we use machine learning as a component of many internal tools, such as routing and prioritizing our internal support e-mail queue.  By automatically categorizing and estimating an “urgency” for inbound support e-mails, we can assign support requests to the appropriate personnel and ensure that urgent requests are handled by staff more rapidly, helping to ensure a good customer experience.

To quickly develop these types of predictive models while making use of our MapReduce cluster, we decided to construct our own machine learning framework, open source under the name “Conjecture.”  It consists of three main parts:

  1. Java classes which define the machine learning models and data types.

  2. Scala methods which perform MapReduce training using Scalding.

  3. PHP classes which use the produced models to make predictions in real-time on the web site.

This article is intended to give a brief introduction to predictive modeling, an overview of Conjecture’s capabilities, as well as a preview of what we are currently developing.

Predictive Models

The main goal in constructing a predictive model is to make a function which maps an input into a prediction.  The input can be anything – from the text of an email, to the pixels of an image, or the list of users who interacted with an item.  The predictions we produce currently are of two types: either a real value (such as a click rate) or a discrete label (such as the name of an inbox in which to direct an email).

The only prerequisite for constructing such a model is a source of training examples: pairs consisting of an input and its observed output.  In the case of click rate prediction these can be constructed by examining historic click rates of items.  For the classification of e-mails as urgent or not, we took the historic e-mails as the inputs, and an indicator of whether the support staff had marked the email as urgent or not after reading its contents.


Feature Representation

Having gathered the training data, then next step is to convert it into a representation which Conjecture can understand.  As is common in machine learning, we convert the raw input into a feature representation, which involves evaluating several “feature functions” of the input and constructing a feature vector from the results.  For example, to classify emails the feature functions are things like: the indicator of whether the word “account” is in the email, whether the email is from a registered user or not, whether the email is a follow up to an earlier email and so on.

Additionally, for e-mail classification we also included subsequences of words which appeared in the email. For example the feature representation of an urgent email which was from a registered user, and had the word “time” in the subject, and the string “how long will it take to receive our item?” in the body may look like:

"label": {"value" : 1.0},
  "vector" : {
    "subject___time" : 1.0,
    "body___will::it::take" : 1.0,
    "body___long::will": 1.0,
    "body___to::receive::our" : 1.0,
    "is_registered___true" : 1.0,

We make use of a sparse feature representation, which is a mapping of string feature names to double values.  We use a modified GNU trove hashmap to store this information while being memory efficient.

Many machine learning frameworks store features in a purely numerical format. While storing information as, for instance, an array of doubles is far more compact than our “string-keyed vectors”, model interpretation and introspection becomes much more difficult. The choice to store the names of features along with their numeric value allows us to easily inspect for any weirdness in our input, and quickly iterate on models by finding the causes of problematic predictions.

Model Estimation

Once a dataset has been assembled, we can estimate the optimal predictive model.  In essence we are trying to find a model that makes predictions which tend to agree with the observed outcomes in the training set.  The statistical theory surrounding “Empirical Risk Minimization” tells us that under some mild conditions, we may expect a similar level of accuracy from the model when applied to unseen data.

Conjecture, like many current libraries for performing scalable machine learning, leverages a family of techniques known as online learning. In online learning, the model processes the labeled training examples one at a time, making an update to the underlaying prediction function after each observation.  While the online learning paradigm isn’t compatible with every machine learning technique, it is a natural fit for several important  classes of machine learning models such as logistic regression and large margin classification, both of which are implemented in Conjecture.

Since we often have datasets with millions of training examples, processing them sequentially on a single machine is unfeasible. Some form of parallelization is required. However, traditionally the online learning framework does not directly lend itself to a parallel method for model training. Recent research into parallelized online learning in Hadoop gives us a way to perform sequential updates of many models in parallel across many machines, each separate process consuming a fraction of the total available training data. These “sub-models” are aggregated into a single model that can be used to make predictions or, optionally, feed back into another “round” of the process for the purpose of further training. Theoretical results tell us, that when performed correctly, this process will result in a reliable predictive model, similar to what would be generated had there been no parallelization. In practice, we find the models converge to an accurate state quickly, after a few iterations.


Conjecture provides an implementation for parallelized online optimization of machine learning models in Scalding,  a Scala wrapper for Cascading — a package which plans workflows consisting of several MapReduce jobs each.  Scalding also abstracts over the key-value pairs required by MapReduce, and permits arbitrary n-ary tuples to be used as the data elements. In Conjecture, we train a separate model on each mapper using online updates, by making use of the map-side aggregation functionality of Cascading. Here, in each mapper process, data is grouped by common keys and a reduce function is applied. Data is then sent across the network to reduce nodes which complete the aggregation. This map-side aggregation is conceptually the same as the combiners of MapReduce, though the work resides in the same process as the map task itself.  The key to our distributed online machine learning algorithm is in the definition of appropriate reduce operations so that map-side aggregators will implement as much of the learning as possible.  The alternative — shuffling the examples to reducers processes across the MapReduce cluster and performing learning there — is slower as it requires much more communication.

During training, the mapper processes consume an incoming stream of training instances. The mappers take these examples and emit pairs consisting of the example and an “empty” (untrained) predictive model. This sequence of  pairs is passed to the aggregator, where the actual online learning is performed. The aggregator process implements a reduce function which takes two such pairs and produces a single pair with an updated model.  Call the pairs a and b, each with a member model and a member called example. Pairs “rolled up” by this reduce process always contain a model, and an example which has not yet been used for training.  When the reduce operation is consuming two models which both have some training, they are merged (e.g., by summing or averaging the parameter values) otherwise, we continue to train whichever model has some training already.  Due to the order in which Cascading calls the reduce function in the aggregator, we end up building a single model on each machine, these are then shuffled to a single reducer where they are merged.  Finally we can update the final model on the one labeled example which it has not yet seen.  Note that the reduce function we gave is not associative — the order in which the pairs are processed will affect the output model.  However, this approach is robust in that it will produce a useful model irrespective of the ordering. The logic for the reduce function is:

train_reduce(a,b) = {

  // neither pair has a trained model,
  // update a model on one example,
  // emit that model and the other example 
  if(a.model.isEmpty && b.model.isEmpty) {  
    (b.model.train(a.example), b.example)

  // one model is trained, other isn't.
  // update trained model, emit that
  // and the other example
  if(!a.model.isEmpty && b.model.isEmpty) {
    (a.model.train(b.example), a.example)

  // mirroring second case
  if(a.model.isEmpty && !b.model.isEmpty) {
    (b.model.train(a.example), b.example)

  // both models are partially trained. 
  // update one model, merge that model
  // with the other. emit with
  // unobserved example
  if(!a.model.isEmpty && !b.model.isEmpty) {
    (b.model.merge(a.model.train(a.example)), b.example)

In conjecture, we extend this basic approach, consuming one example at a time, so that we can perform “mini batch” training.  This is a variation of online learning where a small set of training examples are all used at once to perform a single update.  This leads to more flexibility in the types of models we can train, and also lends better statistical properties to the resulting models (for example by reducing variance in the gradient estimates in the case of logistic regression).  What’s more it comes at no increase in computational complexity over the standard training method.  We implement mini-batch training by amending the reduce function to construct lists of examples, only performing the training when sufficiently many examples have been aggregated.

Model Evaluation

We implement a parallelized cross validation so that we can get estimates of the performance of the output models on unseen inputs.  This involves splitting the data up into a few parts (called “folds” in the literature), training multiple models, each of which is not trained on one of the folds, and then testing each model on the fold which it didn’t see during training.  We consider several evaluation metrics for the models, such as accuracy, the area under the ROC curve, and so on.  Since this procedure yields one set of performance metrics for each fold, we take the appropriately weighted means of these, and the observed variance also gives an indication of the reliability of the estimate.  Namely if the accuracy of the classification is similar across all the folds then we may anticipate a similar level of accuracy on unobserved inputs.  On the other hand if there is a great discrepancy between the performance on different folds then it suggests that the mean will be an unreliable estimate of future performance, and possibly that either more data is needed or the model needs some more feature engineering.

Inevitably, building useful machine learning models requires iteration– things seldom work very well right out of the box. Often this iteration involves inspecting a model, developing an understanding as to why a model operates in the way that it does, and then fixing any unusual or undesired behavior. Leveraging our detailed data representation, we have developed tools enabling the manual  examination of the learned models.  Such inspection should only be carried out for debugging purposes, or to ensure that features were correctly constructed, and not to draw conclusions about the data.  It is impossible to draw valid conclusions about parameter values without knowing their associated covariance matrix — which we do not handle since it is presumably too large.

Applying the Models

As part of the model training, we output a JSON-serialized version of the model. This allows us to load the models into any platform we’d like to use. In our case, we want to use our models on our web servers, which use PHP. To accomplish this, we deploy our model file (a single JSON string encoding the internal model data structures) to the servers where we instantiate it using PHP’s json_decode() function. We also provide utility functions in PHP to process model inputs into the same feature representations used in Java/Scala, ensuring that a model is correctly applied. An example of a json-encoded conjecture model is below:

  "argString": "--zero_class_prob 0.5 --model mira --out_dir contact_seller 
                --date 2014_05_24 --folds 5 --iters 10 
                --input contact_seller/instances 
                --final_thresholding 0.001 --hdfs",
  "exponentialLearningRateBase": 1.0,
  "initialLearningRate": 0.1,
  "modelType": "MIRA",
  "param": {
    "freezeKeySet": false,
    "vector": {
      "__bias__": -0.16469815089457646,
      "subject___time": -0.01698080417481483,
      "body___will::it::take": 0.05834880357927012,
      "body___long::will": 0.0818060174986067991,
      "is_registered___true": -0.002215130480164454,

Currently the JSON-serialized models are stored in a git repository and deployed to web servers via Deployinator (see the Code as Craft post about it here). This architecture was chosen so that we could use our established infrastructure to deploy files from git to the web servers to distribute our models. We gain the ability to quickly prototype and iterate on our work, while also gaining the ability to revert to previous versions of a model at will. Our intention is to move to a system of automated nightly updates, rather than continue with the manually controlled Deployinator process.

Deployinator broadcasts models to all the servers running code that could reference Conjecture models, including the web hosts, and our cluster of gearman workers that perform asynchronous tasks, as well as utility boxes which are used to run cron jobs and ad hoc jobs. Having the models local to the code that’s referencing them avoids network overhead associated with storing models in databases; the process of reading and deserializing models, then making predictions is extremely fast.

Conclusions and Future Work

The initial release of Conjecture shares some of Etsy’s tooling for building classification and regression models at massive scale using Hadoop. This infrastructure is well-tested and practical, we’d like to get it in the hands of the community as soon as possible. However, this release represents only a fraction of the machine learning tooling that we use to power many features across the site.  Future releases of Conjecture will include tools for building cluster models and infrastructure for building recommender systems on implicit feedback data in Scalding. Finally, we will release “web code” written in PHP and other languages that can consume Conjecture models and make predictions efficiently in a live production environment.


Introducing nagios-herald

Posted by on June 6, 2014 / 3 Comments

Alert Design

Alert design is not a solved problem. And it interests me greatly.

What makes for a good alert? Which information is most relevant when a host or service is unavailable? While the answer to those, and other, questions depends on a number of factors (including what the check is monitoring, which systems and services are deemed critical, what defines good performance, etc.), at a minimum, alerts should contain some amount of appropriate context to aid an on-call engineer in diagnosing and resolving an alerting event.

When writing Nagios checks, I ask the following questions to help suss out what may be appropriate context:

On the last point, about automating work, I believe that computers can, and should, do as much work as possible for us before they have to wake us up. To that end, I’m excited to release nagios-herald today!

nagios-herald: Rub Some Context on It

nagios-herald was created from a desire to supplement an on-call engineer’s awareness of conditions surrounding a notifying event. In other words, if a computer is going to page me at 3AM, I expect it to do some work for me to help me understand what’s failing. At its core, nagios-herald is a Nagios notification script. The power, however, lies in its ability to add context to Nagios alerts via formatters.

One of the best examples of nagios-herald in action is comparing the difference between disk space alerts with and without context.

Disk Space Alert


I’ve got a vague idea of which volume is problematic but I’d love to know more. For example, did disk space suddenly increase? Or did it grow gradually, only tipping the threshold as my head hit the pillow?

Disk Space Alert, Now *with* Context!


In the example alert above, a stack bar clearly illustrates which volume the alert has fired on. It includes a Ganglia graph showing the gradual increase in disk storage over the last 24 hours. And the output of the df command is highlighted, helping me understand which threshold this check exceeded.

For more examples of nagios-herald adding context, see the example alerts page in the GitHub repo.

“I Have Great Ideas for Formatters!”

I’m willing to bet that at some point, you looked at a Nagios alert and thought to yourself, “Gee, I bet this would be more useful if it had a little more information in it…”  Guess what? Now it can! Clone the nagios-herald repo, write your own custom formatters, and configure nagios-herald to load them.

I look forward to feedback from the community and pull requests!

Ryan tweets at @Ryan_Frantz and blogs at ryanfrantz.com.


Q1 2014 Site Performance Report

Posted by on May 15, 2014 / 3 Comments

May flowers are blooming, and we’re bringing you the Q1 2014 Site Performance Report. There are two significant changes in this report: the synthetic numbers are from Catchpoint instead of WebPagetest, and we’re going to start labeling our reports by quarter instead of by month going forward.

The backend numbers for this report follow the trend from December 2013 – performance is slightly up across the board. The front-end numbers are slightly up as well, primarily due to experiments and redesigns. Let’s dive into the data!

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, April 23rd:

Server Side Performance

There was a small increase in both median and 95th percentile load times over the last three months across the board, with a larger jump on the homepage. We are currently running a few experiments on the homepage, one of which is significantly slower than other variants, which is bringing up the 95th percentile. While we understand that this may skew test results, we want to get preliminary results from the experiment before we spend engineering effort on optimizing this variant.

As for the small increases everywhere else, this has been a pattern over the last six months, and is largely due to new features adding a few milliseconds here and there, increased usage from other countries (translating the site has a performance cost), and overall added load on our infrastructure.  We expect to see a slow increase in load time for some period of time, followed by a significant dip as we upgrade or revamp pieces of our infrastructure that are suffering. As long as the increases aren’t massive this is a healthy oscillation, and optimizes for time spent on engineering tasks.

Synthetic Front-end Performance

Because of some implementation details with our private WebPagetest instance, the data we have for Q1 isn’t consistent and clean enough to provide a true comparison between the last report and this one.  The good news is that we also use Catchpoint to collect synthetic data, and we have data going back to well before the last report.  This enabled us to pull the data from mid-December and compare it to data from April, on the same days that we pulled the server side and RUM data.

Our Catchpoint tests are run with IE9 only, and they run from New York, London, Chicago, Seattle, and Miami every two hours.  The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page.  Here is that data:

Synthetic Performance - Catchpoint

The increase on the homepage is somewhat expected due to the experiments we are running and the increase in the backend time. The search page also saw a large increase both Start Render and Webpage Response, but we are currently testing a completely revamped search results page, so this is also expected.  The listing page also had a modest jump in start render time, and again this is due to differences in experiments that were running in December vs. April.

Real User Front-end Performance

As always, these numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

Real User Monitoring

No big surprises here, we see the same bump on the homepage and the search results page that we did in the server side and synthetic numbers. Everything else is essentially neutral, and isn’t particularly exciting. In future reports we are going to consider breaking this data out by region, by mobile vs. desktop, or perhaps providing other percentiles outside of the median (which is the 50th percentile).


We are definitely in a stage of increasing backend load time and front-end volatility due to experiments and general feature development. The performance team has been spending the past few months focusing on some internal tools that we hope to open source soon, as well as running a number of experiments ourselves to try to find some large perf wins. You will be hearing more about these efforts in the coming months, and hopefully some of them will influence future performance reports!

Finally, our whole team will be at Velocity Santa Clara this coming June, and Lara, Seth, and I are all giving talks.  Feel free to stop us in the hallways and say hi!


Web Experimentation with New Visitors

Posted by on April 3, 2014 / 4 Comments

We strive to build Etsy with science, and therefore love how web experimentation and A/B testing help us drive our product development process. Several months ago we started a series of web experiments in order to improve Etsy’s homepage experience for first-time visitors. Testing against a specific population, like first-time visitors, allowed us to find issues and improve our variants without raising concerns in our community. This is how the page used to look for new visitors:


We established both qualitative and quantitative goals to measure improvements for the redesign. On the qualitative side, our main goal was to successfully communicate to new buyers that Etsy is a global marketplace made by people. On the quantitative side, we primarily cared about three metrics: bounce rate, conversion rate, and retention over time. Our aim was to reduce bounce rate (percentage of visits who leave the site after viewing the homepage) without affecting conversion rate (proportion of visits that resulted in a purchase) and visit frequency. After conducting user surveys, usability tests, and analyzing our target web metrics, we have finally reached those goals and launched a better homepage for new visitors. Here’s what the new homepage looks like:


Bucketing New Visitors

This series of web experiments marked the first time at Etsy where we tried to consistently run an experiment only for first-time visitors over a period of time. While identifying a new visitor is relatively straightforward, the logic to present that user with the same experience on subsequent visits is something less trivial.

Bucketing a Visitor

At Etsy we use our open source Feature API for A/B testing. Every visitor is assigned a unique ID when they arrive to the website for the first time. In order to determine in which bucket of a test the visitor belongs to, we generate a deterministic hash using the visitor’s unique ID and the experiment identifier. The main advantage of using this hash for bucketing is that we don’t have to worry about creating or managing multiple cookies every time we bucket a visitor into an experiment.

Identifying New Visitors

One simple way to identify a new visitor is by the absence of etsy.com cookies in the browser. On our first set of experiments we checked for the existence of the __utma cookie from Google Analytics, which we also used to define visits in our internal analytics stack.

Returning New Visitors

Before we define a returning new visitor, we need first to describe the concept of a visit. We use the Google Analytics visit definition, where a visit is a group of user interactions on our website within a given time frame. One visitor can produce multiple visits on the same day, or over the following days, weeks, or months. In a web experiment, the difference between a returning visitor and a returning new visitor is the relationship between the experiment start time and the visitor’s first landing time on the website. To put it simply, every visitor who landed on the website for the first time after the experiment start date will be treated as a new visitor, and will consistently see the same test variant on their first and subsequent visits.

As I mentioned before, we used the __utma cookie to identify visitors. One advantage of this cookie is that it tracks the first time a visitor landed on the website. Since we have access to the first visit start time and the experiment start time, we can determine if a visitor is eligible to see an experiment variant. In the following diagram we show two visitors and their relation with the experiment start time.


Feature API

We added the logic to compare a visitor’s first landing time against an experiment start time as part of our internal Feature API. This way it’s really simple to set up web experiments targeting new visitors. Here is an example of how we set up an experiment configuration and an API entry point.

Configuration Set-up:

$server_config['new_homepage'] => [
   'enabled' => 50,
   'eligibility' => [
       'first_visit' => [
           'after' => TIMESTAMP

API Entry Point:

if (Feature::isEnabled('new_homepage')) {
   $controller = new Homepage_Controller();

Unforeseen Events

When we first started analyzing the test results, we found that more than 10% of the visitors in the experiment had first visit landing times prior to our experiment start day. This suggested that old, seasoned Etsy users were being bucketed into this experiment. After investigating, we were able to correlate those visits to a specific browser: Safari 4+. The visits were a result of the browser making requests to generate thumbnail images for the Top Sites feature. These type of requests are generated any time a user is on the browser, even without visiting Etsy. On the web analytics side, this created a visit with a homepage view followed by an exit event. Fortunately, Safari provides a way to identify these requests using the additional HTTP header “X-Purpose: preview”. Finally, after filtering these requests, we were able to correct this anomaly in our data. Below you can see the experiment’s bounce rates significantly decreased after getting rid of these automated visits.


Although verifying the existence of cookies to determine whether a visitor is new may seem trivial, it is hard to be completely certain that a visitor has never been to your website before based on this signal alone. One person can use multiple browsers and devices to view the same website: mobile, tablet, work or personal computer, or even borrow any other device from a friend. Here is when more deep analysis can come in handy, like filtering visits using attributes such as user registration and signed-in events.


We are confident that web experimentation with new visitors is a good way to collect unbiased results and to reduce product development concerns such as disrupting existing users’ experiences with experimental features. Overall, this approach allows us to drive change. Going forward, we will use what we learned from these experiments as we develop new iterations of the homepage for other subsets of our members. Now that all the preparatory work is done, we can ramp-up this experiment, for instance, to all signed-out visitors.

You can follow Diego on Twitter at @gofordiego