Building A Better Build: Our Transition From Ant To SBT

Posted by on September 30, 2014 / 5 Comments

A build tool is fundamental to any non-trivial software project.  A good build tool should be as unobtrusive as possible, letting you focus on the code instead of the mechanics of compilation.  At Etsy we had been using Ant to build our big data stack.  While Ant did handle building our projects adequately, it was a common source of questions and frustration for new and experienced users alike.  When we analyzed the problems users were having with the build process, we decided to replace Ant with SBT, as it was a better fit for our use cases.  In this post I’ll discuss the reasons we chose SBT as well as some of the details of the actual process of switching.

Why Did We Switch?

There were two perspectives we considered when choosing a replacement for Ant.  The first is that of a user of our big data stack.  The build tool should stay out of the way of these users as much as possible.  No one should ever feel it is preventing them from being productive, but instead that it is making them more productive.  SBT has a number of advantages in this regard:

  1. Built-in incremental compiler: We used the stand-alone incremental compiler zinc for our Ant build.  However, this required a custom Ant task, and both zinc and that task needed to be installed properly before users could start building.  This was a common source of questions for users new to our big data stack.  SBT ships with the incremental compiler and uses it automatically.
  2. Better environment handling: Because of the size of our code base, we need to tune certain JVM options when compiling it.  With Ant, these options had to be set in the ANT_OPTS environment variable.  Not having ANT_OPTS set properly was another common source of problems for users.  There is an existing community-supported SBT wrapper that solves this problem.  The JVM options we need are set in a .jvmopts file that is checked in with the code.
  3. Triggered execution: If you prefix any SBT command or sequence of commands with a tilde, it will automatically run that command every time a source file is modified.  This is a very powerful tool for getting immediate feedback on changes.  Users can compile code, run tests, or even launch a Scalding job automatically as they work.
  4. Build console: SBT provides an interactive console for running build commands.  Unlike running commands from the shell, this console supports tab-completing arguments to commands and allows temporarily modifying settings.

The other perspective is that of someone modifying the build.  It should be straightforward to make changes.  Furthermore, it should be easy to take advantage of existing extensions from the community.  SBT is also compelling in this regard:

  1. Build definition in Scala: The majority of the code for our big data stack is Scala.  With Ant, modifying the build requires a context switch to its XML-based definition.  An SBT build is defined using Scala, so no context switching is necessary.  Using Scala also provides much more power when defining build tasks.  We were able to replace several Ant tasks that invoked external shell scripts with pure Scala implementations.  Defining the build in Scala does introduce more opportunities for bugs, but you can use scripted to test parts of your build definition.
  2. Plugin system: To extend Ant, you have to give it a JAR file, either on the command line or by placing it in certain directories.  These JAR files then need to be made available to everyone using the build.  With SBT, all you need to do is add a line like
    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")

    to the build definition.  SBT will then automatically download that plugin and its dependencies.  By default SBT will download dependencies from Maven Central, but you can configure it to use only an internal repository:

    resolvers += resolver("internal-repo", "<repository URL>")
    externalResolvers <<= resolvers map { rs =>
        Resolver.withDefaultResolvers(rs, mavenCentral = false)
  3. Ease of inspecting the build: SBT has an “inspect” command that will provide a lot of information about a setting or task: the type, any related settings or tasks, and forward and reverse dependencies.  This is invaluable when trying to understand how some part of the build works.
  4. Debugging support: Most SBT tasks produce verbose debug logs that are not normally displayed.  The “last” command will output those logs, making tracking down the source of an error much easier.

This is not to say that SBT is perfect.  There are some aspects of SBT that are less than ideal:

  1. Terse syntax: The SBT build definition DSL is full of hieroglyphic symbols, such as the various assignment operators: <<=, <+=, and <++=.  This terse syntax can be intimidating for those that are not used to it.
  2. Support for compiling a subset of files: The Scala compiler can be slow, so our Ant build supported compiling only a subset of files.  This was very easy to add in Ant.  SBT required a custom plugin that delved into SBT’s internal APIs.  This does work, but we have already had to do some minor refactoring between SBT 0.13.1 and 0.13.2.  We have to accept that that such incompatibility is likely to occur between future minor versions.
  3. Less mature: SBT is a relatively new tool, and the plugin ecosystem is still growing.  As such, there were several times we had to write custom plugins for functionality that would likely already be supported by another tool.  We also experienced a couple of bugs in SBT itself during the switch.

At this point it’s fair to ask why SBT, and not Maven, Gradle, or some other tool.  You can certainly get features like these in other tools.  We did not do a detailed comparison of multiple tools to replace Ant.  SBT addressed our pain points with the Ant build, and it was already popular with users of our big data stack.

The Transition Process

It may sound cliché, but the initial hurdle for actually switching to SBT was gaining an in-depth understanding of the tool.  SBT’s terminology and model of a build is very different from Ant.  SBT’s documentation is very good and very thorough, however.  Starting the switch without having read through it would have resulted in a lot of wasted time.  It’s a lot easier to find answers online when you know how SBT names things!

The primary goal when switching was to have the SBT build be a drop-in replacement for the Ant build.  This removed the need to modify any external processes that use the build artifacts.  It also allowed us to have the Ant and SBT builds in parallel for a short time in case of bugs in the SBT build.  Unfortunately, our project layout did not conform to SBT’s conventions — but SBT is fully configurable in this regard.  SBT’s “inspect” command was helpful here for discovering which settings to tweak.  It was also good that SBT supports using an external ivy.xml file for dependency management.  We were already using Ivy with our Ant build, so we were able to define our dependencies in one place while running both builds in parallel.

With the configuration to match our project layout taken care of, actually defining the build with SBT was a straightforward task.  Unlike Ant, SBT has built-in tasks for common operations, like compiling, running tests, and creating a JAR.  SBT’s multi-project build support also cut down on the amount of configuration necessary.  We have multiple projects building together, and every single Ant task had to be defined to build them in the correct order.  Once we configured the dependencies between the projects in SBT, it automatically guaranteed this.

We also took advantage of SBT’s aliasing feature to make the switch easier on users.  The names of SBT’s built-in tasks did not always align with the names we had picked when defining the corresponding task in Ant.  It’s very easy to alias commands, however:

addCommandAlias("jar", "package")

With such aliases users were able to start using SBT just as they had used Ant, without needing to learn a whole set of new commands.  Aliases also make it easy to define sequences of commands without needing to create an entirely new task, such as

addCommandAlias("rebuild", ";clean; compile; package")

The process of switching was not entirely smooth, however.  We did run into two bugs in SBT itself.  The first is triggered when you define an alias that is the prefix of a task in the alias definition.  Tab-completing or executing that alias causes SBT to hang for some time and eventually die with a StackOverflowError.  This bug is easy to avoid if you define your alias names appropriately, and it is fixed in SBT 0.13.2.  The other bug only comes up with multi-module projects.  Even though our modules have many dependencies in common, SBT will re-resolve these dependencies for each module.  There is now a fix for this in SBT 0.13.6, the most recently released version, that can be enabled by adding

updateOptions := updateOptions.value.withConsolidatedResolution(true)

to your build definition.  We saw about a 25% decrease in time spent in dependency resolution as a result.

Custom Plugins

As previously mentioned, we had to write several custom plugins during the process of switching to SBT to reproduce all the functionality of our Ant build.  We are now open-sourcing two of these SBT plugins.  The first is sbt-checkstyle-plugin.  This plugin allows running Checkstyle over Java sources with the checkstyle task.  The second is sbt-compile-quick-plugin.  This plugin allows you to compile and package a single file with the compileQuick and packageQuick tasks, respectively.  Both of these plugins are available in Maven Central.


Switching build tools isn’t a trivial task, but it has paid off.  Switching to SBT has allowed us to address multiple pain points with our previous Ant build.  We’ve been using SBT for several weeks now.  As with any new tool, there was a need for some initial training to get everyone started.  Overall though, the switch has been a success!   The number of issues encountered with the build has dropped.  As users learn SBT they are taking advantage of features like triggered execution to increase their productivity.


Expanding Diversity Efforts with Hacker School

Posted by on September 25, 2014 / 2 Comments

Today we’re proud to announce that Etsy will provide $210,000 in Etsy Hacker Grants to Hacker School applicants in the coming year. These grants extend our support of Hacker School’s diversity initiatives, which first focused on the gender imbalance of applicants to their program and in the wider industry, and will now expand to support applicants with racial backgrounds that are underrepresented in software engineering.

The grants will provide up to $7,000 in support for at least 35 accepted applicants in the next three batches of Hacker School, and are designed to help with a student’s living expenses during their three-month curriculum in New York City.

Diversity and opportunity lie at the heart of what Etsy is about, a place where anyone can start a business for $0.20. Today we wrote a post talking about how we think about diversity at Etsy: More Than Just Numbers. As an engineering team, diversity and opportunity are core values for us as well — we believe a diverse environment is a resilient one. We love what we do and want to extend that opportunity to anyone who wants it.

This doesn’t mean the work to support diversity is easy or that we’ve mastered it — but we are committed to continuing to improve. Over the years, we’ve focused on educational programs looking at unconscious bias, bringing in speakers from NCWIT, and building out our internal leadership courses to support a broad swath of new leaders.

Hacker School grants have been one of our favorite and most effective programs since sponsoring our first batch of students in summer 2012. We’ve even given talks about how well it went. Hacker School’s selection process and environment combine to create a group of students diverse across a number of axes, including gender, race, experience and technical background, but that are also culturally and technically aligned with Etsy’s engineering team. Hacker School’s welcoming “programming retreat” approach produces the sort of broad, deep, tool-building and curious system engineers that work well in our distributed, iterative, transparent and scaled environment. We have Hacker School alums across almost every team in Etsy engineering and at nearly every level, from just starting their first job to very senior engineers.

We know that racial diversity is a complicated issue, and we are by no means the experts. But we believe that together with Hacker School we are making a small step in the right direction.

And we need your help. This program only works if qualified applicants hear that it’s happening, and know that we really want them to apply. If you know someone who you think would be great, please let them know, and encourage them to apply to an upcoming batch!


Come find Etsy at Velocity NY 2014

Posted by on September 10, 2014 / No Responses

Velocity is our kind of conference, and Velocity NY happens in our backyard in that funny little borough on the other side of the river. (Manhattan) Make sure to come find us, we’ll be there teaching, speaking, and keynoting:

Monday 9am “PostMortem Facilitation: Theory and Practice of “New View” Debriefings” – John Allspaw

Monday 1:30pm “Building a Device Lab” – Lara Swanson and Destiny Montague

Tuesday 1:15pm “A Woman’s Place is at the Command Prompt” – Lara Swanson (moderator), Katherine Daniels (Etsy), Jennifer Davis (Chef), Bridget Kromhout (DramaFever), Samantha Thomas (UMN)

Tuesday 3:30pm “It’s 3AM, Do You Know Why You Got Paged?” – Ryan Frantz

Tuesday 5pm “Etsy’s Journey to Building a Continuous Integration Infrastructure for Mobile Apps” – Nassim Kammah

Wednesday 11:20am “Unpacking the Black Box: Benchmarking JS Parsing and Execution on Mobile Devices” – Daniel Espeset

Holding office hours

Nassim Kammah of our continuous integration team, the Autobot

Ryan Frantz our sleep and alert design obsessed operations engineer

John Allspaw, he should look familiar to you by now

Signing books

Jon Cowie will be signing his book “Customizing Chef”
Lara Swanson will be signing galleys of “Designing for Performance”
John Allspaw will be signing “Web Operations” and “Capacity Planning”

No Comments

Teaching Testing: Our Testing 101 Materials

Posted by on August 20, 2014 / 2 Comments

Etsy engineers have a wide variety of backgrounds, strengths, and weaknesses, so there are no engineering skills we can take for granted. And there are things you can’t just assume engineers will learn for themselves because you throw a codebase and a workflow at them.

I work on Etsy’s continuous deployment team, which advises on automated testing of our code, and I felt that we could use some stronger means of teaching (and establishing as a conscious value) the skills of testing and design in code. To that end, I recently wrote two “Testing 101″ materials for use by all Etsy engineers. They’re now both on our public Github: the Etsy Testing Best Practices Guide, and our Testing 101 Code Lab for hands-on practice applying its ideas. Both use PHP and PHPUnit.

We called it the “Testing Best Practices Guide” because we love misnomers. It’s more about design than testing, it describes few concrete practices, and we don’t call any of them “best” .

Within Etsy, we supplement mere documents with activities like team rotations (“bootcamps”) for new hires, technical talks, and dojos (collaborative coding exercises) to practice and have fun with coding topics as a group. And of course, we do code review.

Deeper technical considerations are often invisible in code, so you have to find a way, whether by process, tooling, or teaching, to make them visible.


Q2 2014 Site Performance Report

Posted by on August 1, 2014 / 3 Comments

As the summer really starts to heat up, it’s time to update you on how our site performed in Q2. The methodology for this report is identical to the Q1 report. Overall it’s a mixed bag: some pages are faster and some are slower. We have context for all of it, so let’s take a look.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, July 16th:

Server Side Performance

A few things stand out in this data:

Median homepage performance improved, but the 95th percentile got slower. This is due to a specific variant we are testing which is slower than the current page. We made some code changes that improved load time for the majority of pageviews, but the one slower test variant brings up the higher percentiles.

The listing page saw a fairly large increase in both median and 95th percentile load time. There isn’t a single smoking gun for this, but rather a number of small changes that caused little increases in performance over the last three months.

Search saw a significant decrease across the board. This is due to a dedicated memcached cluster that we rolled out to cache “listing cards” on the search results page. This brought our cache hit rate for listing related data up to 100%, since we automatically refresh the cache when the data changes. This was a nice win that will be sustainable over the long term.

The shop page saw a big jump at the 95th percentile. This is again due to experiments we are running on this page. A few of the variants we are testing for a shop redesign are slower than the existing page, which has a big impact on the higher percentiles. It remains to be seen which of these variants will win, and which version of the page we will end up with.

Overall we saw more increases than decreases on the backend, but we had a couple of performance wins from code/architecture changes, which is always nice to see. Looking ahead, we are planning on replacing the hardware in our memcached cluster in the next couple of months, and tests show that this should have a positive performance impact across the entire site.

Synthetic Front-end Performance

As a reminder, these tests are run with Catchpoint. They use IE9, and they run from New York, London, Chicago, Seattle, and Miami every two hours. The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page. Here is that data:

Synthetic Performance

The render start metrics are pretty much the same across the board, with a couple of small decreases that aren’t really worth calling out due to rounding error and network variability. The “webpage response” numbers, on the other hand, are up significantly across the board. This is easily explained: we recently rolled out full site TLS, and changed our synthetic tests to hit https URLs. The added TLS negotiation time for all assets on the page bumped up the overall page load time everywhere. One thing we noticed with this change is that due to most browsers making six TCP connections per domain, we pay this TLS negotiation cost many times per page. We are actively investigating SPDY with the goal of sending all of our assets over one connection and only doing this negotiation once.

Real User Front-end Performance

As always, these numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

RUM Data

One change here is that we are showing an extra significant figure in the RUM data. We increased the number of beacons that we send to mPulse, and our error margin dropped to 0.00 seconds, so we feel confident showing 10ms resolution. We see the expected drop in search load time because of the backend improvement, and everything else is pretty much neutral. The homepage ticked up slightly, which is expected due to the experiment that I mentioned in the server side load time section.

One obvious question is: “Why did the synthetic numbers change so much while the RUM data is pretty much neutral?”. Remember, the synthetic numbers changed primarily because of a change to the tests themselves. The switch to https caused a step change in our synthetic monitoring, but for real users the rollout was gradual. In addition, real users that see more than one page have some site resources in their browser cache, mitigating some of the extra TLS negotiations. Our synthetic tests always operate with an empty cache, which is a bit unrealistic. This is one of the reasons why we have both synthetic and RUM metrics: if one of them looks a little wonky we can verify the difference with other data. Here’s a brief comparison of the two, showing where each one excels:

Synthetic Monitoring Real User Monitoring
Browser Instrumentation Navigation Timing API
Consistent Trending over time Can be highly variable as browsers and networks change
Largely in your control Last mile difficulties
Great for identifying regressions Great for comparing across geographies/browsers
Not super realistic from an absolute number point of view “Real User Monitoring”
A/B tests can show outsized results due to empty caches A/B tests will show the real world impact


This report had some highs and some lows, but at the end of the day our RUM data shows that our members are getting roughly the same experience they were a few months ago performance wise, with faster search results pages. We’re optimistic that upgrading our memcached cluster will put a dent in our backend numbers for the next report, and hopefully some of our experiments will have concluded with positive results as well. Look for another report from us as the holiday season kicks off!


Just Culture resources

Posted by on July 18, 2014 / No Responses

This is a (very) incomplete list of resources that may be useful to those wanting to find out more about the ideas behind Just Culture, blamelessness, complex systems, and related topics. It was created to support my DevOps Days Minneapolis talk Fallible Humans.

Human error and sources of failure

Just Culture

Postmortems and blameless reviews

Complex systems and complex system failure

No Comments

Calendar Hacks

Posted by on July 15, 2014 / 3 Comments

As an engineering manager, there’s one major realization you have: managers go to lots of meetings. After chatting with a bunch of fellow engineering managers at Etsy, I realized that people have awesome hacks for managing their calendars and time. Here are some of the best ones from a recent poll of Etsy engineering managers! We’ll cover tips on how to:

To access any of the Labs settings/apps:


Block out time

Create big unscheduled blocks every week. It allows for flexibility in your schedule. Some block out 3 afternoons/week as “office hours — don’t schedule unless you’re on my team”. It creates uninterrupted time when I’m *not* in meetings and available on IRC. Some book time to work on specific projects, and mark the event as private. They’ll try to strategize the blocking to prevent calendar fragmentation.

2-auto-declineAutomatically decline events (Labs): Lets you block off times in your calendar when you are unavailable. Invitations sent for any events during this period will be automatically declined. After you enable this feature, you’ll find a “Busy (decline invitations)” option in the “Show me as” field.

Office Hours: Blocks for office hours allow you to easily say, “yes, I want to talk to you, but can we schedule it for a time I’ve already set aside?” Better than saying “No, I cannot talk to you, my calendar is too full.” (Which also has to happen from time to time.) When you create a new event on your calendar, choose the “Appointment slots” link in the top of the pop-up:


Then follow the instructions to create a bookable slot on your calendar. You’ll be able to share a link to a calendar with your bookable time slots:


Declining meetings: Decline meetings with a note. Unless you work with the organizer often, do this automatically if there’s no context for the meeting. One manager has declined 1 hour meetings, apologized for missing them, asked for a follow up, and found that he really wasn’t needed in the first place. Hours saved!

Change your defaults

Shorter meetings: Don’t end meetings on the hour; use 25/50 minute blocks. You can set this in Calendars>My Calendars>Settings>General>Default Meeting Length. If you set your calendar invites to 25 minutes or 55 minutes, you need to assert that socially at the beginning of the meeting, and then explicitly do a time check at 20 minutes (“we have 5 minutes left in this meeting”). If possible, start with the 25 (or 30) minute defaults rather than the hour-long ones.


Visible vs private: Some make their calendar visible, not private by default. This lets other people see what they’re doing so they have a sense of whether they can schedule against something or not–a team meeting–maybe, mad men discussion group, no way.

Custom view: Create a custom 5- or 7-day view or use Week view.


Color coding:

Gentle Reminders (Labs): This feature replaces Calendar’s pop-ups: when you get a reminder, the title of the Google Calendar window or tab will happily blink in the background and you will hear a pleasant sound.

Event dimming: dim past events, and/or recurring future events, so you can focus.


Editability: Make all meetings modifiable and including this in the invite description, so as to avoid emails about rescheduling (“you have modifying rights — reschedule as needed, my schedule is updated”) and ask reports and your manager to do the same when they send  invites. This can result in fewer emails back and forth to reschedule.

Rely on apps and automation

Sunrise: Sunrise for iOS does the right thing and doesn’t show declined events, which weaponizes the auto-decline block.

8-world-clockWorld Clock (Labs) helps you keep track of the time around the world. Plus: when you click an event, you’ll see the start time in each time zone as well. You could alternatively add an additional timezone (settings -> general -> Your Time Zone -> Additional Timezone).

Free or busy (Labs): See which of your friends are free or busy right now (requires that friends share their Google Calendars with you.

Next meeting (Labs): See what’s coming up next in your calendar.

Soft timer: Set a timer on your phone by default at most of meetings, group or 1:1s,  and tell folks you’re setting an alarm to notify everyone when you have 5 minutes left. People love this, especially in intense 1:1s, because they don’t have to worry about the time. It can go really well as a reminder to end in “Speedy Meeting” time.

Do routine cleanup

Calendar review first thing in the morning. Review and clean up next three days. Always delete the “broadcast” invites you’re not going to attend.

Block off some empty timeslots first thing in the morning to make sure you have some breaks during the day—using Busy / Auto Decline technology.

People don’t seem to notice when you note all day events at the top of your calendar. If you’re going offsite,  book an event that’s 9am-7pm that will note your location.

Think ahead (long-term)

Book recurring reminders: For instance, do a team comp review every three months. Or if there is a candidate we lost out on, make appointments to follow up with them in a few months.

Limit Monday recurring meetings: Holiday weekends always screw you up when you have to reschedule all of those for later in the week.

Track high-energy hours: “I tracked my high energy hours against my bright-mind-needed tasks and lo-and-behold, realized that mornings is when I need a lot of time to do low-urgency/high importance things that otherwise I wasn’t making time for or I was doing in a harried state. I thus time-blocked 3 mornings a week where from 8am to 11am I am working on these things (planning ahead, staring at the wall thinking through a system issue, etc). It requires a new 10pm to 6am sleeping habit, but it has been fully worth it, I feel like I gained a day in my way. This means I no longer work past 6pm most days, which was actually what was most draining to me.”

Create separate calendars

Have a shared calendar for the team for PTO tracking, bootcampers, time out of the office, team standups etc.

Subscribe to the Holidays in the United States Calendar so as to not be surprised by holidays: Calendar ID:

Subscribe to the conference room calendars if that’s something your organization has.

Create a secondary calendar for non critical events, so they stay visible but don’t block your calendar. If there’s an event you are interested in, but haven’t decided on going or not, and don’t want other people to feel the need to schedule around it, you can go the event and copy it. Then remove it from your primary calendar. You can toggle the secondary calendar off via the side-panel, and if someone needs to set something up with you, you’ll be available.9-mergeWhen using multiple calendars, there may be events that are important to have on multiple calendars, but this takes up a lot of screen real estate. In these cases, we use Etsy engineer Amy Ciavolino’s Event Merge for Google Calendar Chrome extension. It makes it easy to see those events without them taking up precious screen space.

And, for a little break from the office, be sure to check out Etsy analyst Hilary Parker’s Sunsets in Google Calendar (using R!).


Threat Modeling for Marketing Campaigns

Posted by on July 7, 2014 / 1 Comment

Marketing campaigns and referrals programs can reliably drive growth. But whenever you offer something of value on the Internet, some people will take advantage of your campaign. How can you balance driving growth with preventing financial loss from unchecked fraud?

In this blog post, I want to share what we learned about how to discourage and respond to fraud when building marketing campaigns for e-commerce websites. I personally found there wasn’t a plethora of resources in the security community focused on the specific challenges faced by marketing campaigns, and that motivated me to try and provide more information on the topic for others — hopefully you find this helpful!


Since our experience came from developing a referrals program at Etsy, I want to describe our program and how we created terms of service to discourage fraud. Then rather than getting into very specific details about what we do to respond in real-time to fraud, I want to outline useful questions for anyone building these kinds of programs to ask themselves.

Our Referrals Program

We encouraged people to invite their friends to shop on the site. When the friend created an account on Etsy, we gave them $5.00 to spend on their first purchase. We also wanted to reward the person who invited them, as a thank-you gift.

Invite Your Friends - Recipient Promo

Of course, we knew that any program that gives out free money is an attractive one to try and hack!

Invite Your Friends - Sender Promo

Discourage Bad Behavior from the Start

First, we wanted to make our program sustainable through proactive defenses. When we designed the program we tried to bake in rules to make the program less attractive to attackers. However, we didn’t want these rules to introduce roadblocks in the product that made the program less valuable from users’ perspectives, or financially unsustainable from a business perspective.

In the end, we decided on the following restrictions. We wanted there to be a minimum spend requirement for buyers to apply their discount to a purchase. We hoped that requiring buyers to put in some of their own money to get the promotion would attract more genuine shoppers and discourage fraud.

We also put limits on the maximum amount we’d pay out in rewards to the person inviting their friends (though we’re keeping an eye out for any particularly successful inviters, so we can congratulate them personally). And we are currently only disbursing rewards to inviters after two of their invited friends make a purchase.

Model All Possible Scenarios

A key principle of Etsy’s approach to security is that it’s important to model out all possible approaches of attack, notice the ones that are easiest to accomplish or result in the biggest payouts if successful, and work on making them more difficult, so it becomes less economical for fraudsters to try those attacks.

In constructing these plans, we first tried to differentiate the ways our program could be attacked and by what kind of users. Doing this, we quickly realized that we wanted to respond differently to users with a record of good behavior on the site from users who didn’t have a good history or who were redeeming many referrals, possibly with automated scripts. We also wanted to have the ability to adjust the boundaries of the two categories over time.

So the second half of our defense against fraud consisted of plans on how to monitor and react to suspected fraud as well as how to prevent attackers in the worst-case scenario from redeeming large amounts of money.

Steps to Develop a Mitigation Plan

I have to admit upfront, I’m being a little ambiguous about what we’ve actually implemented, but I believe it doesn’t really matter since each situation will differ in the particulars. That being said, here are the questions that guided the development of our program, that could guide your thinking too.


# 1. How can you determine whether two user accounts represent the same person in real life?

This question is really key. In order to detect fraud on a referrals program, you need to be able to tell if the same person is signing up for multiple accounts.

In our program, we kept track of many measures of similarity. One important kind of relatedness was what we called “invite relatedness.” A person looking to get multiple credits is likely to have generated multiple invites off of one original account. To check for this, and other cases, we had to keep a graph data structure of whether one user had invited another via a referral and do a breadth first search to determine related accounts.

Another important type of connection between accounts is often called “fingerprint relatedness.” We keep track of fingerprints (unique identifying characteristics of an account) and user to fingerprint relationships, so we can look up what accounts share fingerprints. There are a lot of resources available about how to fingerprint accounts in the security community that I would highly recommend researching!

Here’s an example of a very simple invite graph of usernames, colored by fingerprint relatedness. As you can see, the root user SarahJ28 might have invited three people, but two of them are related to her via other identifying characteristics.

Simple Invite Graph


# 2. At what point in time are all the different signals of identity discussed in the previous question available to you?

You don’t know everything about a user from the instant they land on your site. At that point in time, you might only have a little bit of information about their IP and browser.  You start to learn a bit more about them based on their email if they sign up for an account, and you certainly have more substantive information for analysis when they perform other actions on the site, like making purchases.

Generally, our ability to detect identity gets stronger the more engaged someone is with the site, or the closer they move towards making a purchase. However, if someone has a credit on their account that you don’t want them to have, you need to identify them before they complete their purchase. The level of control you have over the process of purchasing will depend on how you process credit card transactions on your site.


# 3. What are the different actions you could take on user accounts if you discover that one user has invited many related accounts to a referrals program?

There’s generally a range of different actions that can be taken against user accounts on a site. Actions that were relevant to us included: banning user accounts, taking away currently earned promotional credits, blocking promotional credits from being applied to a purchase transaction, and taking away the ability to refer more people using our program.

 Signals of identity become stronger with more site engagement


# 4. Do any actions need to be reviewed by a person or can they be automatic? What’s the cost of doing a manual review?On the other hand, how would a user feel if an automated action was taken against their account based on a false positive?

We knew that in some cases we would feel comfortable programming automated consequences to user accounts, while in other cases we wanted manual review of the suspected account first. It was really helpful for us to work with the teams who would be reviewing the suspected accounts on this from the beginning. They had seen lots of types of fraud in the past and helped us calibrate our expectations around false positives from each type of signal.

Luckily for us at Etsy, it’s quite easy to code the creation of tasks for our support team to review from any part of our web stack. I highly recommend architecting this ability if you don’t already have it for your site because it’s useful in many situations besides this one. Of course, we had to be very mindful that the real cost would come from continually reviewing the tasks over time.


# 5. How can you combine what you know about identity and user behavior (at each point in time) with your range of possible actions to come up with checks that you feel comfortable with? Do these checks need to block the action a user is trying to take?

We talked about each of these points where we evaluated a user’s identity and past behavior as a “check” that could either implement one of these actions I described or create a task for a person to review.

This meant we also had to decide whether the check needed to block the user from completing an action, like getting a promotional credit on their account, or applying a promotional credit to a transaction, and how that action was currently implemented on the site.

It’s important to note that if the user is trying to accomplish something in the scope of a single web request, there is a tradeoff between how thoroughly you can investigate someone’s identity and how quickly you can return a response to the user. After all, there are over 30 million user accounts on Etsy and we could potentially need to compare fingerprints across all of them. To solve this problem, we had to figure out how to kick off asynchronously running jobs at key points (especially checkout) that could do the analysis offline, but would nevertheless provide the protection we wanted.


# 6. How visible are account statuses throughout your internal systems?

Once you’ve taken an automated or manual action against an account, is that clearly marked on that user account throughout your company’s internal tools? A last point of attack may be that someone writes in complaining their promo code didn’t work. If this happens, it’s important for the person answering that email to know they’ve been flagged as under investigation or deemed fraudulent.


# 7.  Do you have reviews of your system scheduled for concrete dates in the future? Can you filter fraudulent accounts from your metrics when you want to?

If your referrals campaign doesn’t have a concrete end date, then it’s easy to forget about how it’s performing, not just in terms of meeting business goals but in terms of fraud. It’s important to have an easy way to filter out duplicate user accounts to calculate true growth, as well as how much of an effect unchecked fraud would have had on the program and how much was spent on what was deemed an acceptable, remaining risk. If we had discovered that too much was being spent on allowing good users to get away with a few duplicate referrals, we could have tightened our guidelines and started taking away the ability to send referrals from accounts more frequently.

We found that when we opened up tasks for manual review, the team member reviewing them marked them as accurate 75% of the time. This was pretty good relative to other types of tasks we review. We were also pretty generous in the end in trusting that multiple members of a household might be using the same credit card info.


Our project revealed that some fraud is catastrophic and should absolutely be prevented, while other types of fraud, like a certain level of duplicate redemptions in marketing campaigns, are less dangerous and require a gentler response or even a degree of tolerance.

We have found it useful to review what could happen, design the program with rules to discourage all kinds fraud while keeping value to the user in mind, have automated checks as well as manual reviews, and monitoring that includes the ability to segment the performance of our program based on fraud rate.

Many thanks to everyone at Etsy, but especially our referrals team, the risk and integrity teams, the payments team, and the security team for lots of awesome collaboration on this project!

1 Comment

Device Lab Checkout – RFID Style

Posted by on June 24, 2014 / 2 Comments

You may remember reading about our mobile device lab here at Etsy.  Ensuring the site works and looks proper on all devices has been a priority for all our teams. As the percentage of mobile device traffic to the site continues to increase (currently it’s more than 50% of our traffic), so does the number of devices we have in our lab. Since it is a device lab after-all, we thought it was only appropriate to trick it out with a device oriented check-out system. Something to make it easy for designers and developers to get their hands on the device they need quickly and painlessly, and a way to keep track of who had what device. Devices are now checked in and out with just two taps, or “bumps”, to an RFID reader. And before things got too high tech, we hide all the components into a custom made woodland stump created by an amazing local Etsy seller Trish Czech.

Etsy RFID Check in/out system

“Bump the Stump” to check in and out devices from our Lab

If you’re able to make it out to Velocity Santa Clara 2014, you’ll find a full presentation on how to build a device lab. However, I’m going to focus on just the RFID aspect of our system and some of the issues we faced.

RFID – What’s the frequency

The first step in converting from our old paper checkout system to an RFID based one, was deciding on the correct type of RFID to use. RFID tags can come in a number of different and incompatible frequencies. In most corporate environments, if you are using RFID tags already, they will probably either be high frequency (13.54 mHz) or low frequency (125 kHz) tags. While we had a working prototype with high frequency NFC based RFID tags, we switched to low frequency since all admin already carry a low frequency RFID badge around with them. However, our badges are not compatible with a number of off the shelf RFID readers. Our solution was to basically take one of the readers off a door and wire it up to our new system.

You will find that most low frequency RFID readers transmit card data using the Wiegand protocol over their wires. This protocol uses two wires, commonly labeled “DATA0” and “DATA1” to transmit the card data. The number of bits each card will transmit can vary depending on your RFID system, but lets say you had a 11 bit card number which was “00100001000”. If you monitored the Data0 line, you would see that it drops from a high signal to a low signal for 40 microseconds, for each “0” bit in the card number. The same thing happens on the Data1 line for each “1” bit.  Thus if you monitor both lines at the same time, you can read the card number.

Logic Analyzer of Wiegand data

Wiegand data on the wire

We knew we wanted a system that would be low powered and compact.  Thus we wired up our RFID reader to the GPIO pins of a Raspberry Pi. The Raspberry Pi was ideal for this given its small form factor, low power usage, GPIO pins, network connectivity and USB ports (we later ported this to a BeagleBone Black to take advantage of its on-board flash storage).  Besides having GPIO pins to read the Data1 and Data0 lines, the Raspberry Pi also has pins for supplying 3.3 volts and 5 volts of power. Using this, we powered the RFID reader with the 5 volt line directly from the Raspberry Pi. However, the GPIO pins for the Raspberry Pi are 3.3 volt lines, thus the 5 volt Data1 and Data0 lines from the RFID reader could damage them over time. To fix this issue, we used a logic level converter to step down the voltage before connecting Data0 and Data1 to the Raspberry Pi’s GPIO pins.

RFID, Line Converter, LCD, and Raspberry Pi wiring

RFID, Line Converter, LCD, and Raspberry Pi wiring

A Need for Speed

After that, it is fairly easy to write some python code to monitor for transitions on those pins using the RPi.GPIO module. This worked great for us in testing, however, we started to notice a number of incorrect RFID card numbers once we released it. The issue appeared to be that the python code would miss a bit of data from one of the lines or record the transition after a slight delay. Considering a bit is only 40 microseconds long and can happen every 2 milliseconds while a card number is being read, there’s not a lot of time to read a card. While some have reportedly used more hardware to get around this issue, we found that rewriting the GPIO monitoring code in C boosted our accuracy (using a logic analyzer, we confirmed the correct data was coming into the GPIO pins, so it was an issue somewhere after that). Gordon Henderson’s WiringPi made this easy to implement. We also added some logical error checking in the code so we could better inform the user if we happened to detect a bad RFID tag read. This included getting the correct number of bits in a time window and ensuring the most significant bits matched our known values. With python we saw up to a 20% error rate in card reads, and while it’s not perfect, getting a little closer to the hardware with C dropped this to less than 3% (of detectable errors).

Dealing with Anti-Metal

One other issue we ran into was RFID tags attached to devices with metallic cases. These cases can interfere with reading the RFID tags. There are a number of manufacturers which supply high frequency NFC based tag to deal with this, however, I’ve yet to find low frequency tags which have this support and are in a small form factor. Our solution is a little bit of a Frankenstein, but has worked well so far. We’ve been peeling off the shield from on-metal high frequency tag and then attaching them to the back of our standard low frequency tags. Strong fingernails, an utility knife, and a little two sided tape helps with this process.

Removing anti-metal backing from NFC tags

Removing anti-metal backing from NFC tags to attach to Low Frequency tags

Client Code

We’ve posted the client-side code for this project on github ( along with a parts list and wiring diagram. This system checks devices in and out from our internal staff directory on the backend. Look to the README file ways to setup your system for handling these calls. We hope those of you looking to build a device lab, or perhaps an RFID system for checking out yarn, will find this helpful.


Opsweekly: Measuring on-call experience with alert classification

Posted by on June 19, 2014 / 1 Comment

The Pager Life

On-call is a tricky thing. It’s a necessary evil for employees in every tech company, and the responsibility can weigh down on you.

And yet, the most common thing you hear is “monitoring sucks”, “on-call sucks” and so on. At Etsy, we’ve at least come to accept Nagios for what it is, and make it work for us. But what about on-call? And what are we doing about it?

You may have heard that at Etsy, we’re kinda big into measuring everything. “If it moves, graph it, if it doesn’t move graph it anyway” is something that many people have heard me mutter quietly to myself in dark rooms for years. And yet, for on call we were ignoring that advice. We just joined up once a week as a team, to have a meeting and talk about how crappy our on-call was, and how much sleep we lost. No quantification, no action items. Shouldn’t we be measuring this?

Introducing Opsweekly

And so came Opsweekly. Our operations team was growing, and we needed a good place to finally formalise Weekly Reports.. What is everyone doing? And at the same time, we needed to progress and start getting some graphs behind our on-call experiences. So, we disappeared for a few days and came back with some PHP, and Opsweekly was born.

What does Opsweekly do?

In the simplest form, Opsweekly is a great place for your team to get together, share their weekly updates, and then organise a meeting to discuss those reports (if necessary) and take notes on it. But the real power comes from Opsweekly’s built in on call classification and optional sleep tracking.

Every week, your on-call engineer visits Opsweekly, hits a big red “I was on call” button, and Opsweekly pulls in the notifications that they have received in the last week. This can be from whichever data source you desire; Maybe your Nagios instance logs into Logstash or Splunk, or you use Pagerduty for alerting.

An example of on-call report mostly filled in

An example of on-call report mostly filled in

The engineer can make a simple decision from a drop down about what category the alert falls into.

Alert categorisation choices

The list of alert classifications the user can choose from

We were very careful when we designed this list to ensure that every alert type was catered for, but also minimising the amount of choices the engineer had to try and decide from.

The most important part here is the overall category choice on the left:

Action Taken vs No Action Taken

One of the biggest complaints about on call was the noise. What is the signal to noise ratio of your alert system? Well, now we’re measuring that using Opsweekly.

opsweekly works

Percentage of Action vs No Action alerts over the last year

This is just one of the many graphs and reports that Opsweekly can generate using the data that was entered, but this is one of the key points for us: We’ve been doing this for a year and we are seeing an increasingly improving signal to noise ratio. Measuring and making changes based on that can work for your on-call too. 

The value of historical context

So how does this magic happen? By having to make conscious choices about whether the alert was meaningful or not, we can start to make improvements based on that. Move alerts to email only if they’re not urgent enough to be dealt with immediately/during the night.

If the threshold needs adjusting, this serves as a reminder to actually go and adjust the threshold; you’re unlikely to remember or want to do it when you’ve just woken up, or you’ve been context switched. It’s all about surfacing that information.

Alongside the categorisation is a “Notes” field for every alert. A quick line of text in each of these boxes provides invaluable data to other people later on (or maybe yourself!) to gain context about that alert.

Opsweekly has search built in that allows you to go back and inspect the alert time(s) that alert fired, gaining that knowledge of what each previous person did to resolve the alert before you.

Sleep Tracking

A few months in, we were inspired by an Ignite presentation at Velocity Santa Clara about measuring humans. We were taken aback… How was this something we didn’t have?

Now we realised we could have graphs of our activity and sleep, we managed to go a whole 2 days before we got to the airport for the flight home to start purchasing the increasingly common off the shelf personal monitoring devices.

Ryan Frantz wrote here about getting that data available for all to share on our dashboards, using conveniently accessible APIs, and it wasn’t long until it clicked that we could easily query that data when processing on call notifications to get juicy stats about how often people are woken up. And so we did:

Report in Opsweekly showing sleep data for an engineers on-call week

Report in Opsweekly showing sleep data for an engineers on-call week

Personal Feedback

The final step of this is helping your humans understand how they can make their lives better using this data. Opsweekly has that covered too; a personal report for each person


Available on Github now

For more information on how you too can start to measure real data about your on call experiences, read more and get Opsweekly now on Github

Velocity Santa Clara 2014

If you’re attending Velocity in Santa Clara, CA next week, Ryan and I are giving a talk about our Nagios experiences and Opsweekly, entitled “Mean Time to Sleep: Quantifying the On-Call Experience”. Come and find us if you’re in town!


Ryan Frantz and Laurie Denness now know their co-workers sleeping patterns a little too well…

1 Comment