Device Lab Checkout – RFID Style

Posted by on June 24, 2014 / 2 Comments

You may remember reading about our mobile device lab here at Etsy.  Ensuring the site works and looks proper on all devices has been a priority for all our teams. As the percentage of mobile device traffic to the site continues to increase (currently it’s more than 50% of our traffic), so does the number of devices we have in our lab. Since it is a device lab after-all, we thought it was only appropriate to trick it out with a device oriented check-out system. Something to make it easy for designers and developers to get their hands on the device they need quickly and painlessly, and a way to keep track of who had what device. Devices are now checked in and out with just two taps, or “bumps”, to an RFID reader. And before things got too high tech, we hide all the components into a custom made woodland stump created by an amazing local Etsy seller Trish Czech.

Etsy RFID Check in/out system

“Bump the Stump” to check in and out devices from our Lab

If you’re able to make it out to Velocity Santa Clara 2014, you’ll find a full presentation on how to build a device lab. However, I’m going to focus on just the RFID aspect of our system and some of the issues we faced.

RFID – What’s the frequency

The first step in converting from our old paper checkout system to an RFID based one, was deciding on the correct type of RFID to use. RFID tags can come in a number of different and incompatible frequencies. In most corporate environments, if you are using RFID tags already, they will probably either be high frequency (13.54 mHz) or low frequency (125 kHz) tags. While we had a working prototype with high frequency NFC based RFID tags, we switched to low frequency since all admin already carry a low frequency RFID badge around with them. However, our badges are not compatible with a number of off the shelf RFID readers. Our solution was to basically take one of the readers off a door and wire it up to our new system.

You will find that most low frequency RFID readers transmit card data using the Wiegand protocol over their wires. This protocol uses two wires, commonly labeled “DATA0” and “DATA1” to transmit the card data. The number of bits each card will transmit can vary depending on your RFID system, but lets say you had a 11 bit card number which was “00100001000”. If you monitored the Data0 line, you would see that it drops from a high signal to a low signal for 40 microseconds, for each “0” bit in the card number. The same thing happens on the Data1 line for each “1” bit.  Thus if you monitor both lines at the same time, you can read the card number.

Logic Analyzer of Wiegand data

Wiegand data on the wire

We knew we wanted a system that would be low powered and compact.  Thus we wired up our RFID reader to the GPIO pins of a Raspberry Pi. The Raspberry Pi was ideal for this given its small form factor, low power usage, GPIO pins, network connectivity and USB ports (we later ported this to a BeagleBone Black to take advantage of its on-board flash storage).  Besides having GPIO pins to read the Data1 and Data0 lines, the Raspberry Pi also has pins for supplying 3.3 volts and 5 volts of power. Using this, we powered the RFID reader with the 5 volt line directly from the Raspberry Pi. However, the GPIO pins for the Raspberry Pi are 3.3 volt lines, thus the 5 volt Data1 and Data0 lines from the RFID reader could damage them over time. To fix this issue, we used a logic level converter to step down the voltage before connecting Data0 and Data1 to the Raspberry Pi’s GPIO pins.

RFID, Line Converter, LCD, and Raspberry Pi wiring

RFID, Line Converter, LCD, and Raspberry Pi wiring

A Need for Speed

After that, it is fairly easy to write some python code to monitor for transitions on those pins using the RPi.GPIO module. This worked great for us in testing, however, we started to notice a number of incorrect RFID card numbers once we released it. The issue appeared to be that the python code would miss a bit of data from one of the lines or record the transition after a slight delay. Considering a bit is only 40 microseconds long and can happen every 2 milliseconds while a card number is being read, there’s not a lot of time to read a card. While some have reportedly used more hardware to get around this issue, we found that rewriting the GPIO monitoring code in C boosted our accuracy (using a logic analyzer, we confirmed the correct data was coming into the GPIO pins, so it was an issue somewhere after that). Gordon Henderson’s WiringPi made this easy to implement. We also added some logical error checking in the code so we could better inform the user if we happened to detect a bad RFID tag read. This included getting the correct number of bits in a time window and ensuring the most significant bits matched our known values. With python we saw up to a 20% error rate in card reads, and while it’s not perfect, getting a little closer to the hardware with C dropped this to less than 3% (of detectable errors).

Dealing with Anti-Metal

One other issue we ran into was RFID tags attached to devices with metallic cases. These cases can interfere with reading the RFID tags. There are a number of manufacturers which supply high frequency NFC based tag to deal with this, however, I’ve yet to find low frequency tags which have this support and are in a small form factor. Our solution is a little bit of a Frankenstein, but has worked well so far. We’ve been peeling off the shield from on-metal high frequency tag and then attaching them to the back of our standard low frequency tags. Strong fingernails, an utility knife, and a little two sided tape helps with this process.

Removing anti-metal backing from NFC tags

Removing anti-metal backing from NFC tags to attach to Low Frequency tags

Client Code

We’ve posted the client-side code for this project on github ( along with a parts list and wiring diagram. This system checks devices in and out from our internal staff directory on the backend. Look to the README file ways to setup your system for handling these calls. We hope those of you looking to build a device lab, or perhaps an RFID system for checking out yarn, will find this helpful.


Opsweekly: Measuring on-call experience with alert classification

Posted by on June 19, 2014 / 1 Comment

The Pager Life

On-call is a tricky thing. It’s a necessary evil for employees in every tech company, and the responsibility can weigh down on you.

And yet, the most common thing you hear is “monitoring sucks”, “on-call sucks” and so on. At Etsy, we’ve at least come to accept Nagios for what it is, and make it work for us. But what about on-call? And what are we doing about it?

You may have heard that at Etsy, we’re kinda big into measuring everything. “If it moves, graph it, if it doesn’t move graph it anyway” is something that many people have heard me mutter quietly to myself in dark rooms for years. And yet, for on call we were ignoring that advice. We just joined up once a week as a team, to have a meeting and talk about how crappy our on-call was, and how much sleep we lost. No quantification, no action items. Shouldn’t we be measuring this?

Introducing Opsweekly

And so came Opsweekly. Our operations team was growing, and we needed a good place to finally formalise Weekly Reports.. What is everyone doing? And at the same time, we needed to progress and start getting some graphs behind our on-call experiences. So, we disappeared for a few days and came back with some PHP, and Opsweekly was born.

What does Opsweekly do?

In the simplest form, Opsweekly is a great place for your team to get together, share their weekly updates, and then organise a meeting to discuss those reports (if necessary) and take notes on it. But the real power comes from Opsweekly’s built in on call classification and optional sleep tracking.

Every week, your on-call engineer visits Opsweekly, hits a big red “I was on call” button, and Opsweekly pulls in the notifications that they have received in the last week. This can be from whichever data source you desire; Maybe your Nagios instance logs into Logstash or Splunk, or you use Pagerduty for alerting.

An example of on-call report mostly filled in

An example of on-call report mostly filled in

The engineer can make a simple decision from a drop down about what category the alert falls into.

Alert categorisation choices

The list of alert classifications the user can choose from

We were very careful when we designed this list to ensure that every alert type was catered for, but also minimising the amount of choices the engineer had to try and decide from.

The most important part here is the overall category choice on the left:

Action Taken vs No Action Taken

One of the biggest complaints about on call was the noise. What is the signal to noise ratio of your alert system? Well, now we’re measuring that using Opsweekly.

opsweekly works

Percentage of Action vs No Action alerts over the last year

This is just one of the many graphs and reports that Opsweekly can generate using the data that was entered, but this is one of the key points for us: We’ve been doing this for a year and we are seeing an increasingly improving signal to noise ratio. Measuring and making changes based on that can work for your on-call too. 

The value of historical context

So how does this magic happen? By having to make conscious choices about whether the alert was meaningful or not, we can start to make improvements based on that. Move alerts to email only if they’re not urgent enough to be dealt with immediately/during the night.

If the threshold needs adjusting, this serves as a reminder to actually go and adjust the threshold; you’re unlikely to remember or want to do it when you’ve just woken up, or you’ve been context switched. It’s all about surfacing that information.

Alongside the categorisation is a “Notes” field for every alert. A quick line of text in each of these boxes provides invaluable data to other people later on (or maybe yourself!) to gain context about that alert.

Opsweekly has search built in that allows you to go back and inspect the alert time(s) that alert fired, gaining that knowledge of what each previous person did to resolve the alert before you.

Sleep Tracking

A few months in, we were inspired by an Ignite presentation at Velocity Santa Clara about measuring humans. We were taken aback… How was this something we didn’t have?

Now we realised we could have graphs of our activity and sleep, we managed to go a whole 2 days before we got to the airport for the flight home to start purchasing the increasingly common off the shelf personal monitoring devices.

Ryan Frantz wrote here about getting that data available for all to share on our dashboards, using conveniently accessible APIs, and it wasn’t long until it clicked that we could easily query that data when processing on call notifications to get juicy stats about how often people are woken up. And so we did:

Report in Opsweekly showing sleep data for an engineers on-call week

Report in Opsweekly showing sleep data for an engineers on-call week

Personal Feedback

The final step of this is helping your humans understand how they can make their lives better using this data. Opsweekly has that covered too; a personal report for each person


Available on Github now

For more information on how you too can start to measure real data about your on call experiences, read more and get Opsweekly now on Github

Velocity Santa Clara 2014

If you’re attending Velocity in Santa Clara, CA next week, Ryan and I are giving a talk about our Nagios experiences and Opsweekly, entitled “Mean Time to Sleep: Quantifying the On-Call Experience”. Come and find us if you’re in town!


Ryan Frantz and Laurie Denness now know their co-workers sleeping patterns a little too well…

1 Comment

Conjecture: Scalable Machine Learning in Hadoop with Scalding

Posted by on June 18, 2014 / 5 Comments


Predictive machine learning models are an important tool for many aspects of e-commerce.  At Etsy, we use machine learning as a component in a diverse set of critical tasks. For instance, we use predictive machine learning models to estimate click rates of items so that we can present high quality and relevant items to potential buyers on the site.  This estimation is particularly important when used for ranking our cost-per-click search ads, a substantial source of revenue. In addition to contributing to on-site experiences, we use machine learning as a component of many internal tools, such as routing and prioritizing our internal support e-mail queue.  By automatically categorizing and estimating an “urgency” for inbound support e-mails, we can assign support requests to the appropriate personnel and ensure that urgent requests are handled by staff more rapidly, helping to ensure a good customer experience.

To quickly develop these types of predictive models while making use of our MapReduce cluster, we decided to construct our own machine learning framework, open source under the name “Conjecture.”  It consists of three main parts:

  1. Java classes which define the machine learning models and data types.

  2. Scala methods which perform MapReduce training using Scalding.

  3. PHP classes which use the produced models to make predictions in real-time on the web site.

This article is intended to give a brief introduction to predictive modeling, an overview of Conjecture’s capabilities, as well as a preview of what we are currently developing.

Predictive Models

The main goal in constructing a predictive model is to make a function which maps an input into a prediction.  The input can be anything – from the text of an email, to the pixels of an image, or the list of users who interacted with an item.  The predictions we produce currently are of two types: either a real value (such as a click rate) or a discrete label (such as the name of an inbox in which to direct an email).

The only prerequisite for constructing such a model is a source of training examples: pairs consisting of an input and its observed output.  In the case of click rate prediction these can be constructed by examining historic click rates of items.  For the classification of e-mails as urgent or not, we took the historic e-mails as the inputs, and an indicator of whether the support staff had marked the email as urgent or not after reading its contents.


Feature Representation

Having gathered the training data, then next step is to convert it into a representation which Conjecture can understand.  As is common in machine learning, we convert the raw input into a feature representation, which involves evaluating several “feature functions” of the input and constructing a feature vector from the results.  For example, to classify emails the feature functions are things like: the indicator of whether the word “account” is in the email, whether the email is from a registered user or not, whether the email is a follow up to an earlier email and so on.

Additionally, for e-mail classification we also included subsequences of words which appeared in the email. For example the feature representation of an urgent email which was from a registered user, and had the word “time” in the subject, and the string “how long will it take to receive our item?” in the body may look like:

"label": {"value" : 1.0},
  "vector" : {
    "subject___time" : 1.0,
    "body___will::it::take" : 1.0,
    "body___long::will": 1.0,
    "body___to::receive::our" : 1.0,
    "is_registered___true" : 1.0,

We make use of a sparse feature representation, which is a mapping of string feature names to double values.  We use a modified GNU trove hashmap to store this information while being memory efficient.

Many machine learning frameworks store features in a purely numerical format. While storing information as, for instance, an array of doubles is far more compact than our “string-keyed vectors”, model interpretation and introspection becomes much more difficult. The choice to store the names of features along with their numeric value allows us to easily inspect for any weirdness in our input, and quickly iterate on models by finding the causes of problematic predictions.

Model Estimation

Once a dataset has been assembled, we can estimate the optimal predictive model.  In essence we are trying to find a model that makes predictions which tend to agree with the observed outcomes in the training set.  The statistical theory surrounding “Empirical Risk Minimization” tells us that under some mild conditions, we may expect a similar level of accuracy from the model when applied to unseen data.

Conjecture, like many current libraries for performing scalable machine learning, leverages a family of techniques known as online learning. In online learning, the model processes the labeled training examples one at a time, making an update to the underlaying prediction function after each observation.  While the online learning paradigm isn’t compatible with every machine learning technique, it is a natural fit for several important  classes of machine learning models such as logistic regression and large margin classification, both of which are implemented in Conjecture.

Since we often have datasets with millions of training examples, processing them sequentially on a single machine is unfeasible. Some form of parallelization is required. However, traditionally the online learning framework does not directly lend itself to a parallel method for model training. Recent research into parallelized online learning in Hadoop gives us a way to perform sequential updates of many models in parallel across many machines, each separate process consuming a fraction of the total available training data. These “sub-models” are aggregated into a single model that can be used to make predictions or, optionally, feed back into another “round” of the process for the purpose of further training. Theoretical results tell us, that when performed correctly, this process will result in a reliable predictive model, similar to what would be generated had there been no parallelization. In practice, we find the models converge to an accurate state quickly, after a few iterations.


Conjecture provides an implementation for parallelized online optimization of machine learning models in Scalding,  a Scala wrapper for Cascading — a package which plans workflows consisting of several MapReduce jobs each.  Scalding also abstracts over the key-value pairs required by MapReduce, and permits arbitrary n-ary tuples to be used as the data elements. In Conjecture, we train a separate model on each mapper using online updates, by making use of the map-side aggregation functionality of Cascading. Here, in each mapper process, data is grouped by common keys and a reduce function is applied. Data is then sent across the network to reduce nodes which complete the aggregation. This map-side aggregation is conceptually the same as the combiners of MapReduce, though the work resides in the same process as the map task itself.  The key to our distributed online machine learning algorithm is in the definition of appropriate reduce operations so that map-side aggregators will implement as much of the learning as possible.  The alternative — shuffling the examples to reducers processes across the MapReduce cluster and performing learning there — is slower as it requires much more communication.

During training, the mapper processes consume an incoming stream of training instances. The mappers take these examples and emit pairs consisting of the example and an “empty” (untrained) predictive model. This sequence of  pairs is passed to the aggregator, where the actual online learning is performed. The aggregator process implements a reduce function which takes two such pairs and produces a single pair with an updated model.  Call the pairs a and b, each with a member model and a member called example. Pairs “rolled up” by this reduce process always contain a model, and an example which has not yet been used for training.  When the reduce operation is consuming two models which both have some training, they are merged (e.g., by summing or averaging the parameter values) otherwise, we continue to train whichever model has some training already.  Due to the order in which Cascading calls the reduce function in the aggregator, we end up building a single model on each machine, these are then shuffled to a single reducer where they are merged.  Finally we can update the final model on the one labeled example which it has not yet seen.  Note that the reduce function we gave is not associative — the order in which the pairs are processed will affect the output model.  However, this approach is robust in that it will produce a useful model irrespective of the ordering. The logic for the reduce function is:

train_reduce(a,b) = {

  // neither pair has a trained model,
  // update a model on one example,
  // emit that model and the other example 
  if(a.model.isEmpty && b.model.isEmpty) {  
    (b.model.train(a.example), b.example)

  // one model is trained, other isn't.
  // update trained model, emit that
  // and the other example
  if(!a.model.isEmpty && b.model.isEmpty) {
    (a.model.train(b.example), a.example)

  // mirroring second case
  if(a.model.isEmpty && !b.model.isEmpty) {
    (b.model.train(a.example), b.example)

  // both models are partially trained. 
  // update one model, merge that model
  // with the other. emit with
  // unobserved example
  if(!a.model.isEmpty && !b.model.isEmpty) {
    (b.model.merge(a.model.train(a.example)), b.example)

In conjecture, we extend this basic approach, consuming one example at a time, so that we can perform “mini batch” training.  This is a variation of online learning where a small set of training examples are all used at once to perform a single update.  This leads to more flexibility in the types of models we can train, and also lends better statistical properties to the resulting models (for example by reducing variance in the gradient estimates in the case of logistic regression).  What’s more it comes at no increase in computational complexity over the standard training method.  We implement mini-batch training by amending the reduce function to construct lists of examples, only performing the training when sufficiently many examples have been aggregated.

Model Evaluation

We implement a parallelized cross validation so that we can get estimates of the performance of the output models on unseen inputs.  This involves splitting the data up into a few parts (called “folds” in the literature), training multiple models, each of which is not trained on one of the folds, and then testing each model on the fold which it didn’t see during training.  We consider several evaluation metrics for the models, such as accuracy, the area under the ROC curve, and so on.  Since this procedure yields one set of performance metrics for each fold, we take the appropriately weighted means of these, and the observed variance also gives an indication of the reliability of the estimate.  Namely if the accuracy of the classification is similar across all the folds then we may anticipate a similar level of accuracy on unobserved inputs.  On the other hand if there is a great discrepancy between the performance on different folds then it suggests that the mean will be an unreliable estimate of future performance, and possibly that either more data is needed or the model needs some more feature engineering.

Inevitably, building useful machine learning models requires iteration– things seldom work very well right out of the box. Often this iteration involves inspecting a model, developing an understanding as to why a model operates in the way that it does, and then fixing any unusual or undesired behavior. Leveraging our detailed data representation, we have developed tools enabling the manual  examination of the learned models.  Such inspection should only be carried out for debugging purposes, or to ensure that features were correctly constructed, and not to draw conclusions about the data.  It is impossible to draw valid conclusions about parameter values without knowing their associated covariance matrix — which we do not handle since it is presumably too large.

Applying the Models

As part of the model training, we output a JSON-serialized version of the model. This allows us to load the models into any platform we’d like to use. In our case, we want to use our models on our web servers, which use PHP. To accomplish this, we deploy our model file (a single JSON string encoding the internal model data structures) to the servers where we instantiate it using PHP’s json_decode() function. We also provide utility functions in PHP to process model inputs into the same feature representations used in Java/Scala, ensuring that a model is correctly applied. An example of a json-encoded conjecture model is below:

  "argString": "--zero_class_prob 0.5 --model mira --out_dir contact_seller 
                --date 2014_05_24 --folds 5 --iters 10 
                --input contact_seller/instances 
                --final_thresholding 0.001 --hdfs",
  "exponentialLearningRateBase": 1.0,
  "initialLearningRate": 0.1,
  "modelType": "MIRA",
  "param": {
    "freezeKeySet": false,
    "vector": {
      "__bias__": -0.16469815089457646,
      "subject___time": -0.01698080417481483,
      "body___will::it::take": 0.05834880357927012,
      "body___long::will": 0.0818060174986067991,
      "is_registered___true": -0.002215130480164454,

Currently the JSON-serialized models are stored in a git repository and deployed to web servers via Deployinator (see the Code as Craft post about it here). This architecture was chosen so that we could use our established infrastructure to deploy files from git to the web servers to distribute our models. We gain the ability to quickly prototype and iterate on our work, while also gaining the ability to revert to previous versions of a model at will. Our intention is to move to a system of automated nightly updates, rather than continue with the manually controlled Deployinator process.

Deployinator broadcasts models to all the servers running code that could reference Conjecture models, including the web hosts, and our cluster of gearman workers that perform asynchronous tasks, as well as utility boxes which are used to run cron jobs and ad hoc jobs. Having the models local to the code that’s referencing them avoids network overhead associated with storing models in databases; the process of reading and deserializing models, then making predictions is extremely fast.

Conclusions and Future Work

The initial release of Conjecture shares some of Etsy’s tooling for building classification and regression models at massive scale using Hadoop. This infrastructure is well-tested and practical, we’d like to get it in the hands of the community as soon as possible. However, this release represents only a fraction of the machine learning tooling that we use to power many features across the site.  Future releases of Conjecture will include tools for building cluster models and infrastructure for building recommender systems on implicit feedback data in Scalding. Finally, we will release “web code” written in PHP and other languages that can consume Conjecture models and make predictions efficiently in a live production environment.


Introducing nagios-herald

Posted by on June 6, 2014 / 3 Comments

Alert Design

Alert design is not a solved problem. And it interests me greatly.

What makes for a good alert? Which information is most relevant when a host or service is unavailable? While the answer to those, and other, questions depends on a number of factors (including what the check is monitoring, which systems and services are deemed critical, what defines good performance, etc.), at a minimum, alerts should contain some amount of appropriate context to aid an on-call engineer in diagnosing and resolving an alerting event.

When writing Nagios checks, I ask the following questions to help suss out what may be appropriate context:

On the last point, about automating work, I believe that computers can, and should, do as much work as possible for us before they have to wake us up. To that end, I’m excited to release nagios-herald today!

nagios-herald: Rub Some Context on It

nagios-herald was created from a desire to supplement an on-call engineer’s awareness of conditions surrounding a notifying event. In other words, if a computer is going to page me at 3AM, I expect it to do some work for me to help me understand what’s failing. At its core, nagios-herald is a Nagios notification script. The power, however, lies in its ability to add context to Nagios alerts via formatters.

One of the best examples of nagios-herald in action is comparing the difference between disk space alerts with and without context.

Disk Space Alert


I’ve got a vague idea of which volume is problematic but I’d love to know more. For example, did disk space suddenly increase? Or did it grow gradually, only tipping the threshold as my head hit the pillow?

Disk Space Alert, Now *with* Context!


In the example alert above, a stack bar clearly illustrates which volume the alert has fired on. It includes a Ganglia graph showing the gradual increase in disk storage over the last 24 hours. And the output of the df command is highlighted, helping me understand which threshold this check exceeded.

For more examples of nagios-herald adding context, see the example alerts page in the GitHub repo.

“I Have Great Ideas for Formatters!”

I’m willing to bet that at some point, you looked at a Nagios alert and thought to yourself, “Gee, I bet this would be more useful if it had a little more information in it…”  Guess what? Now it can! Clone the nagios-herald repo, write your own custom formatters, and configure nagios-herald to load them.

I look forward to feedback from the community and pull requests!

Ryan tweets at @Ryan_Frantz and blogs at


Q1 2014 Site Performance Report

Posted by on May 15, 2014 / 3 Comments

May flowers are blooming, and we’re bringing you the Q1 2014 Site Performance Report. There are two significant changes in this report: the synthetic numbers are from Catchpoint instead of WebPagetest, and we’re going to start labeling our reports by quarter instead of by month going forward.

The backend numbers for this report follow the trend from December 2013 – performance is slightly up across the board. The front-end numbers are slightly up as well, primarily due to experiments and redesigns. Let’s dive into the data!

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, April 23rd:

Server Side Performance

There was a small increase in both median and 95th percentile load times over the last three months across the board, with a larger jump on the homepage. We are currently running a few experiments on the homepage, one of which is significantly slower than other variants, which is bringing up the 95th percentile. While we understand that this may skew test results, we want to get preliminary results from the experiment before we spend engineering effort on optimizing this variant.

As for the small increases everywhere else, this has been a pattern over the last six months, and is largely due to new features adding a few milliseconds here and there, increased usage from other countries (translating the site has a performance cost), and overall added load on our infrastructure.  We expect to see a slow increase in load time for some period of time, followed by a significant dip as we upgrade or revamp pieces of our infrastructure that are suffering. As long as the increases aren’t massive this is a healthy oscillation, and optimizes for time spent on engineering tasks.

Synthetic Front-end Performance

Because of some implementation details with our private WebPagetest instance, the data we have for Q1 isn’t consistent and clean enough to provide a true comparison between the last report and this one.  The good news is that we also use Catchpoint to collect synthetic data, and we have data going back to well before the last report.  This enabled us to pull the data from mid-December and compare it to data from April, on the same days that we pulled the server side and RUM data.

Our Catchpoint tests are run with IE9 only, and they run from New York, London, Chicago, Seattle, and Miami every two hours.  The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page.  Here is that data:

Synthetic Performance - Catchpoint

The increase on the homepage is somewhat expected due to the experiments we are running and the increase in the backend time. The search page also saw a large increase both Start Render and Webpage Response, but we are currently testing a completely revamped search results page, so this is also expected.  The listing page also had a modest jump in start render time, and again this is due to differences in experiments that were running in December vs. April.

Real User Front-end Performance

As always, these numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

Real User Monitoring

No big surprises here, we see the same bump on the homepage and the search results page that we did in the server side and synthetic numbers. Everything else is essentially neutral, and isn’t particularly exciting. In future reports we are going to consider breaking this data out by region, by mobile vs. desktop, or perhaps providing other percentiles outside of the median (which is the 50th percentile).


We are definitely in a stage of increasing backend load time and front-end volatility due to experiments and general feature development. The performance team has been spending the past few months focusing on some internal tools that we hope to open source soon, as well as running a number of experiments ourselves to try to find some large perf wins. You will be hearing more about these efforts in the coming months, and hopefully some of them will influence future performance reports!

Finally, our whole team will be at Velocity Santa Clara this coming June, and Lara, Seth, and I are all giving talks.  Feel free to stop us in the hallways and say hi!


Web Experimentation with New Visitors

Posted by on April 3, 2014 / 4 Comments

We strive to build Etsy with science, and therefore love how web experimentation and A/B testing help us drive our product development process. Several months ago we started a series of web experiments in order to improve Etsy’s homepage experience for first-time visitors. Testing against a specific population, like first-time visitors, allowed us to find issues and improve our variants without raising concerns in our community. This is how the page used to look for new visitors:


We established both qualitative and quantitative goals to measure improvements for the redesign. On the qualitative side, our main goal was to successfully communicate to new buyers that Etsy is a global marketplace made by people. On the quantitative side, we primarily cared about three metrics: bounce rate, conversion rate, and retention over time. Our aim was to reduce bounce rate (percentage of visits who leave the site after viewing the homepage) without affecting conversion rate (proportion of visits that resulted in a purchase) and visit frequency. After conducting user surveys, usability tests, and analyzing our target web metrics, we have finally reached those goals and launched a better homepage for new visitors. Here’s what the new homepage looks like:


Bucketing New Visitors

This series of web experiments marked the first time at Etsy where we tried to consistently run an experiment only for first-time visitors over a period of time. While identifying a new visitor is relatively straightforward, the logic to present that user with the same experience on subsequent visits is something less trivial.

Bucketing a Visitor

At Etsy we use our open source Feature API for A/B testing. Every visitor is assigned a unique ID when they arrive to the website for the first time. In order to determine in which bucket of a test the visitor belongs to, we generate a deterministic hash using the visitor’s unique ID and the experiment identifier. The main advantage of using this hash for bucketing is that we don’t have to worry about creating or managing multiple cookies every time we bucket a visitor into an experiment.

Identifying New Visitors

One simple way to identify a new visitor is by the absence of cookies in the browser. On our first set of experiments we checked for the existence of the __utma cookie from Google Analytics, which we also used to define visits in our internal analytics stack.

Returning New Visitors

Before we define a returning new visitor, we need first to describe the concept of a visit. We use the Google Analytics visit definition, where a visit is a group of user interactions on our website within a given time frame. One visitor can produce multiple visits on the same day, or over the following days, weeks, or months. In a web experiment, the difference between a returning visitor and a returning new visitor is the relationship between the experiment start time and the visitor’s first landing time on the website. To put it simply, every visitor who landed on the website for the first time after the experiment start date will be treated as a new visitor, and will consistently see the same test variant on their first and subsequent visits.

As I mentioned before, we used the __utma cookie to identify visitors. One advantage of this cookie is that it tracks the first time a visitor landed on the website. Since we have access to the first visit start time and the experiment start time, we can determine if a visitor is eligible to see an experiment variant. In the following diagram we show two visitors and their relation with the experiment start time.


Feature API

We added the logic to compare a visitor’s first landing time against an experiment start time as part of our internal Feature API. This way it’s really simple to set up web experiments targeting new visitors. Here is an example of how we set up an experiment configuration and an API entry point.

Configuration Set-up:

$server_config['new_homepage'] => [
   'enabled' => 50,
   'eligibility' => [
       'first_visit' => [
           'after' => TIMESTAMP

API Entry Point:

if (Feature::isEnabled('new_homepage')) {
   $controller = new Homepage_Controller();

Unforeseen Events

When we first started analyzing the test results, we found that more than 10% of the visitors in the experiment had first visit landing times prior to our experiment start day. This suggested that old, seasoned Etsy users were being bucketed into this experiment. After investigating, we were able to correlate those visits to a specific browser: Safari 4+. The visits were a result of the browser making requests to generate thumbnail images for the Top Sites feature. These type of requests are generated any time a user is on the browser, even without visiting Etsy. On the web analytics side, this created a visit with a homepage view followed by an exit event. Fortunately, Safari provides a way to identify these requests using the additional HTTP header “X-Purpose: preview”. Finally, after filtering these requests, we were able to correct this anomaly in our data. Below you can see the experiment’s bounce rates significantly decreased after getting rid of these automated visits.


Although verifying the existence of cookies to determine whether a visitor is new may seem trivial, it is hard to be completely certain that a visitor has never been to your website before based on this signal alone. One person can use multiple browsers and devices to view the same website: mobile, tablet, work or personal computer, or even borrow any other device from a friend. Here is when more deep analysis can come in handy, like filtering visits using attributes such as user registration and signed-in events.


We are confident that web experimentation with new visitors is a good way to collect unbiased results and to reduce product development concerns such as disrupting existing users’ experiences with experimental features. Overall, this approach allows us to drive change. Going forward, we will use what we learned from these experiments as we develop new iterations of the homepage for other subsets of our members. Now that all the preparatory work is done, we can ramp-up this experiment, for instance, to all signed-out visitors.

You can follow Diego on Twitter at @gofordiego


Responsive emails that really work

Posted by on March 13, 2014 / 16 Comments

If you’ve ever written an HTML email, you’ll know that the state of the art is like coding for the web 15 years ago: tables and inline styles are the go-to techniques, CSS support is laughably incomplete, and your options for layout have none of the flexibility that you get on the “real web”.

Just like everywhere online, more and more people are using mobile devices to read their email.  At Etsy, more than half of our email opens happen on a mobile device!  Our desktop-oriented, fixed-width designs are beautiful, but mobile readers aren’t getting the best experience.

We want our emails to provide a great experience for everyone, so we’re experimenting with new designs that work on every client: iPhone, iPad, Android,, Outlook, Yahoo Mail, and more.  But given the sorry state of CSS and HTML for email, how can we make an email look great in all those places?

Thanks to one well-informed blog commenter and tons of testing across many devices we’ve found a way to make HTML emails that work everywhere.  You get a mobile-oriented design on phones, a desktop layout on Gmail, and a responsive, fluid design for tablets and desktop clients.  It’s the best of all worlds—even on clients that don’t support media queries.

A New Scaffolding

I’m going to walk you through creating a simple design that showcases this new way of designing HTML emails.  It’s a two-column layout that wraps to a single column on mobile:


For modern browsers, this would be an easy layout to implement—frameworks like Bootstrap provide layouts like this right out of the box.  But the limitations of HTML email make even this simple layout a challenge.

Client Limitations

What limitations are we up against?

On to the Code

Let’s start with a simple HTML structure.

   <table cellpadding=0 cellspacing=0><tr><td>
     <table cellpadding=0 cellspacing=0><tr><td>
         <h2>Main Content</h2>
         <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec gravida sem dictum, iaculis magna ornare, dignissim elit.</p>
         <p>Donec tincidunt tincidunt nunc, eget pulvinar risus sodales eu.</p>

It’s straightforward: a header and footer with two content areas between, main content and a sidebar.  No fancy tags, just divs and tables and paragraphs—we’re still partying like it’s 1999.  (As we apply styling, we’ll see why both wrapping tables are required.)

Initial Styling

Android is the least common denominator of CSS support, allowing only inline CSS in style attributes and ignoring all other styles.  So let’s add inline CSS that gives us a mobile-friendly layout of a fluid single column:

 <body style="margin: 0; padding: 0; background: #ccc;">
   <table cellpadding=0 cellspacing=0 style="width: 100%;"><tr><td style="padding: 12px 2%;">
     <table cellpadding=0 cellspacing=0 style="margin: 0 auto; background: #fff; width: 96%;"><tr><td style="padding: 12px 2%;">
       <h2 style="margin-top: 0;">Main Content</h2>
       <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec gravida sem dictum, iaculis magna ornare, dignissim elit.</p>
       <h2 style="margin-top: 0;">Sidebar</h2>
       <p>Donec tincidunt tincidunt nunc, eget pulvinar risus sodales eu.</p>
     <div style="border-top: solid 1px #ccc;">

It honestly doesn’t look that different from the unstyled HTML (but the groundwork is there for your beautiful ideas!).  The table-within-a-table wrapping all the content lets us have our content area in a colored background, with a small (2%) gutter on each side.  Don’t forget the cellspacing and cellpadding attributes, too, or you’ll get extra spacing that can’t be removed with CSS!

Screen Shot 2014-03-12 at 12.03.47 PM.png

Dealing with Gmail

This design is certainly adequate for both mobile and desktop clients, but it’s not the best we can do.  Desktop clients and large tablets have a lot of screen real estate that we’re wasting.

Our main target here is Gmail—desktop and laptop screens keep getting bigger, and we want Gmail users to get a full-width experience.  But Gmail doesn’t support media queries, the go-to way of showing different layouts on different-sized clients.  What can we do?

I mentioned earlier that Gmail supports a small subset of CSS inside <style> tags.  This is not a widely known feature of Gmail—most resources you’ll find tell you that Gmail only supports inline styles.  Only a handful of blog comments and forum posts mention this support.  I don’t know when Gmail’s CSS support was quietly improved, but I was certainly pleased to learn about this new way of styling my emails.

The subset of CSS that Gmail supports is that you are limited to only using tag name selectors—no classes or IDs are supported.  Coupled with Gmail’s limited whitelist of HTML elements, your flexibility in styling different parts of your email differently is severely limited.  Plus, the <style> tag must be in the <head> of your email, not in the <body>.

The trick is to make judicious use of CSS’s structural selectors: the descendant, adjacent, and child selectors.  By carefully structuring your HTML and mixing and matching these selectors, you can pinpoint elements for providing styles.  Here are the styles I’ve applied to show a two-column layout in Gmail:

          <style type="text/css">
/*  1 */    table table {
/*  2 */      width: 600px !important;
/*  3 */    }
/*  4 */    table div + div { /* main content */
/*  5 */      width: 65%;
/*  6 */      float: left;
/*  7 */    }
/*  8 */    table div + div + div { /* sidebar */
/*  9 */      width: 33%;
/* 10 */      float: right;
/* 11 */    }
/* 12 */    table div + div + div + div { /* footer */
/* 13 */      width: 100%;
/* 14 */      float: none;
/* 15 */      clear: both;
/* 16 */    }

In the absence of classes and IDs to tell you what elements are being styled, comments are your friend!  Let’s walk through each of these selectors.

Lines 1-3 lock our layout to a fixed width.  Remember, this is our style for Gmail on the desktop, where a fluid design isn’t our goal.  We apply this to the inner wrapping table so that padding on the outer one remains, around our 600-pixel-wide content.  Without having both tables, we’d lose the padding that keeps our content from running into the client’s UI.

Next, we style the main content.  The selector on line 4, reading right to left, finds a div immediately following another div, inside a table.  That actually matches the main content, sidebar, and footer divs, but that’s OK for now.  We style it to take up the left two thirds of the content area (minus a small gutter between the columns).

The selector on line 8 styles the sidebar, by finding a div following a div following a div, inside a table. This selects both the footer and the sidebar, but not the main content, and overrides the preceding styles, placing the sidebar in the right-hand third of the content.

Finally, we select the footer on line 12—the only div that follows three others—and make it full-width.  Since the proceeding selectors and styles also applied to this footer div, we need to reset the float style back to none (on line 14).

With that, we have a two-column fixed layout for Gmail, without breaking the one-column view for Android:

Screen Shot 2014-03-12 at 11.30.41 AM.png

The styles we applied to the outer wrapping table keep our content centered, and the other inline styles that we didn’t override (such as the line above the footer) are still rendered.

For Modern Browsers

Finally, let’s consider on iOS and Mac OS X.  I’m lumping them together because they have similar rendering engines and CSS support—the media queries and full CSS selectors you know and love all work.  The styles we applied for Gmail will be also applied on iPhones, giving a mobile-unfriendly fixed-width layout.  We want Android’s single-column fluid layout instead.  We can target modern, small-screen clients (like iPhones) with a media query:

/* Inside <style> in the <head> */
@media (max-width: 630px) {
  table table {
    width: 96% !important;
  table div {
    float: none !important;
    width: 100% !important;

These styles override the earlier ones to restore the single-column layout, but only for devices under 630 pixels wide—the point at which our fixed 600-pixel layout would begin to scroll horizontally.  Don’t forget the !important flag, which makes these styles override the earlier ones. and Android will both ignore this media query.  iPads and on the desktop, which are wider than 630 pixels, will also show the desktop style.

This is admittedly not the prettiest approach. With multiple levels of overriding selectors, you need to think carefully about the impact of any change to your styles.  As your design grows in complexity, you need to keep a handle on which elements will be selected where, particularly with the tag-only selectors for Gmail.  But it’s nearly the holy grail of HTML email: responsive, flexible layouts even in clients with limited support for CSS.

The biggest caveat of this approach (besides the complexity of the code) is the layout on Android tablets: they will display the same single-column layout as Android phones.  For us (and probably for you, too), Android tablets are a vanishingly small percentage of our users.  In any case, the layout isn’t unusable, it’s just not optimal, with wide columns and needlessly large images.

Bringing it All Together

You can find the complete code for this example in this gist:

You can extend this approach to build all kinds of complex layouts.  Just keep in mind the three places where every item might be styled: inline CSS, tag-only selectors in a <style> tag, and one or more media query blocks.  Apply your styles carefully, and don’t forget to test your layout in every client you can your hands on!

I hope that in the future, Gmail on the web and on Android will enter the 21st century and add support for the niceties that CSS has added in recent years.  Until then, creating an HTML email that looks great on every client will continue to be a challenge.  But with a few tricks like these up your sleeve, you can make a beautiful email that gives a great experience for everyone on every client.

You can follow me on Twitter at @kevingessner

Want to help us make Etsy better, from email to accounting? We’re hiring!


Etsy’s Journey to Continuous Integration for Mobile Apps

Posted by on February 28, 2014 / 9 Comments

Positive app reviews can greatly help in user conversion and the image of a brand. on the other hand, bad reviews can have dramatic consequences; as Andy Budd puts it: “Mobile apps live and die by their ratings in an App Store”.


Screen Shot 2014-02-28 at 3.51.58 PM

The above reviews are actual reviews of the Etsy iOS App. As an Etsy developer, it is sad to read them, but it’s a fact: bugs sometimes sneak through our releases. On the web stack, we use our not so secret weapon of Continuous Delivery as a safety net to quickly address bugs that make it to production. However, releasing mobile apps requires a 3rd party’s approval (the app store) , which takes five days on average; once an app is approved, users decide when to upgrade – so they may be stuck with older versions. Based on our analytics data, we currently have 5 iOS and 10 Android versions currently in use by the public.

Through Continuous Integration (CI), we can detect and fix major defects in the development and validation phase of the project, before they negatively impact user experience: this post explores Etsy’s journey to implementing our CI pipeline for our android and iOS applications.

“Every commit should build the mainline on an integration machine”

This fundamental CI principle is the first step to detecting defects as soon as they are introduced: failing to compile. Building your app in an IDE does not count as Continuous Integration. Thankfully, both iOS and Android are command line friendly: building a release of the iOS app is as simple as running:

xcodebuild -scheme "Etsy" archive

Provisioning integration machines

Integration machines are separate from developer machines – they provide a stable, controlled, reproducible environment for builds and tests. Ensuring that all the integration machines are identical is critical – using a provisioning framework to manage all the dependencies is a good solution to ensure uniformity and scalability.

At Etsy, we are pretty fond of Chef to manage our infrastructure – we naturally turned to it to provision our growing Mac Mini fleet. Equipped with the homebrew cookbook for installing packages and rbenv cookbook for managing the ruby environment in a relatively sane way, our sysops wizard Jon Cowie sprinkled a few hdiutil incantations (to manage disk images) and our cookbooks were ready. We are now able to programmatically install 95% of Xcode (some steps are still manual), Git, and all the Android packages required to build and run the tests for our apps.

Lastly, if you ever had to deal with iOS provisioning profiles, you can relate to how annoying they are to manage and keep up to date; having a centralized system that manages all our profiles saves a lot of time and frustration for our engineers.

Building on push and providing daily deploys

With our CI machines hooked up to our Jenkins server, setting up a plan to build the app on every git push is trivial. This simple step helps us detect missing files from commits or compilation issues multiple times a week – developers are notified in IRC or by email and build issues are addressed minutes after being detected. Besides building the app on push, we provide a daily build that any Etsy employee can install on their mobile device – the quintessence of dogfooding. An easy way to encourage our coworkers to install pre-release builds is to nag them when they use the app store version of the app.



iOS devices come in many flavors, with seven different iPads, five iPhones and a few iPods; when it comes to Android, the plethora of devices becomes overwhelming. Even when focusing on the top tier of devices, the goal of CI is to detect defects as soon as they are introduced: we can’t expect our QA team to validate the same features over and over on every push!

Our web stack boasts a pretty extensive collection of test suites and the test driven development culture is palpable. Ultimately, our mobile apps leverage a lot of our web code base to deliver content: data is retrieved from the API and many screens are web views. Most of the core logic of our apps rely on the UI layer – which can be tested with functional tests. As such, our first approach was to focus on some functional tests, given that the API was already tested on the web stack (with unit tests and smoke tests).

Functional tests for mobile apps are not new and the landscape of options is still pretty extensive; in our case, we settled down on Calabash and Cucumber. The friendly format and predefined steps of Cucumber + Calabash allows our QA team to write tests themselves without any assistance from our mobile apps engineers.

To date, our functional tests run on iPad/iPhone iOS 6 and 7 and Android, and cover our Tier 1 features, including:

Because functional tests mimic the steps of an actual user, the tests require that certain assumed resources exist. In the case of the Checkout test, these are the following:

Our checkout test then consists of:

  1. signing in to the app with our test buyer account
  2. searching for an item (in the seller test account shop)
  3. adding it to the cart
  4. paying for the item using the prepaid credit card

Once the test is over, an ad-hoc mechanism in our backend triggers an order cancellation and the credit card is refunded.

A great example of our functional tests catching bugs is highlighted in the following screenshot from our iPad app:


Our registration test navigates to this view, and fills out all the visible fields. Additionally, the test cycles through the “Female“, “Male” and “Rather Not Say” options; in this case, the tests failed (since the “male” option was missing).

By running our test suite every time an engineer pushes code, we not only detect bugs as soon as they are introduced, we detect app crashes. Our developers usually test their work on the latest OS version but Jenkins has their back: our tests run simultaneously across different combinations of devices and OS versions.

Testing on physical devices

While our developers enjoy our pretty extensive device lab for manual testing, maintaining a set of devices and constantly running automated tests on them is a logistical nightmare and a full time job. After multiple attempts at developing an in-house solution, we decided to use Appthwack to run our tests on physical devices. We run our tests for every push on a set of dedicated devices, and run nightly regression on a broader range of devices by tapping into Appthwack cloud of devices. This integration is still very recent and we’re still working out some kinks related to testing on physical devices and the challenges of aggregating and reporting test status from over 200 devices.

Reporting: put a dashboard on it

With more than 15 Jenkins jobs to build and run the tests, it can be challenging to quickly surface critical information to the developers. A simple home grown dashboard can go a long way to communicating the current test status across all configurations:

Mobile apps dashboard

Static analysis and code reviews

Automated tests cannot catch all bugs and potential crashes – similar to the web stack, developers rely heavily on code reviews prior to pushing their code. Like all code at Etsy, the apps are stored in Github Enterprise repositories, and code reviews consist of a pull request and an issue associated to it. By using the GitHub pull request builder Jenkins plugin, we are able to systematically trigger a build and do some static analysis (see static analysis with OCLint post ) on the review request and post the results to the Github issue:

pull request with lint results

Infrastructure overview summary

All in all, our current infrastructure looks like the following:Mobile Apps Infrastructure overview

Challenges and next steps

Building our continuous integration infrastructure was strenuous and challenges kept appearing one after another, such as the inability to automate the installation of some software dependencies. Once stable, we always have to keep up with new releases (iOS 7, Mavericks) which tend to break the tests and the test harness. Furthermore, functional tests are flaky by nature, requiring constant care and optimization.

We are currently at a point where our tests and infrastructure are reliable enough to detect app crashes and tier 1 bugs on a regular basis. Our next step, from an infrastructure point of view, is to expand our testing to physical devices via our test provider Appthwack. The integration has just started but already raises some issues: how can we concurrently run the same checkout test (add an item to cart, buy it using a gift card) across 200 devices – will we create 200 test accounts, one per device? We will post again on our status 6 months from now, with hopefully more lessons learned and success stories – stay tuned!

You can follow Nassim on Twitter at @kepioo 


Reducing Domain Sharding

Posted by on February 19, 2014 / 3 Comments

This post originally appeared on the Perf Planet Performance Calendar on December 7th, 2013.

Domain sharding has long been considered a best practice for pages with lots of images.  The number of domains that you should shard across depends on how many HTTP requests the page makes, how many connections the client makes to each domain, and the available bandwidth.  Since it can be challenging to change this dynamically (and can cause browser caching issues), people typically settle on a fixed number of shards – usually two.

An article published earlier this year by Chromium contributor William Chan outlined the risks of sharding across too many domains, and Etsy was called out as an example of a site that was doing this wrong.  To quote the article: “Etsy’s sharding causes so much congestion related spurious retransmissions that it dramatically impacts page load time.”  At Etsy we’re pretty open with our performance work, and we’re always happy to serve as an example.  That said, getting publicly shamed in this manner definitely motivated us to bump the priority of reinvestigating our sharding strategy.

Making The Change

The code changes to support fewer domains were fairly simple, since we have abstracted away the process that adds a hostname to an image path in our codebase.  Additionally, we had the foresight to exclude the hostname from the cache key at our CDNs, so there was no risk of a massive cache purge as we switched which domain our images were served on.  We were aware that this would expire the cache in browsers, since they do include hostname in their cache key, but this was not a blocker for us because of the improved end result.  To ensure that we ended up with the right final number, we created variants for two, three, and four domains.  We were able to rule out the option to remove domain sharding entirely through synthetic tests.  We activated the experiment in June using our A/B framework, and ran it for about a month.


After looking at all of the data, the variant that sharded across two domains was the clear winner.  Given how easy this change was to make, the results were impressive:

As it turns out, William’s article was spot on – we were sharding across too many domains, and network congestion was hurting page load times.  The new CloudShark graph supported this conclusion as well, showing a peak throughput improvement of 33% and radically reduced spurious retransmissions:

Before – Four Shards


After – Two Shards


Lessons Learned

This story had a happy ending, even though in the beginning it was a little embarrassing.  We had a few takeaways from the experience:

Until SPDY/HTTP 2.0 comes along, domain sharding can still be a win for your site, as long as you test and optimize the number of domains to shard across for your content.


December 2013 Site Performance Report

Posted by on January 23, 2014 / 4 Comments

It’s a new year, and we want to kick things off by filling you in on site performance for Q4 2013. Over the last three months front-end performance has been pretty stable, and backend load time has increased slightly across the board.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, December 18th:

Server Side Performance December 2013

There was an across the board increase in both median and 95th percentile load times over the last three months, with a larger jump on our search results page. There are two main factors that contributed to this increase: higher traffic during the holiday season and an increase in international traffic, which is slower due to translations. On the search page specifically, browsing in US English is significantly faster than any other language. This isn’t a sustainable situation over the long term as our international traffic grows, so we will be devoting significant effort to improving this over the next quarter.

Synthetic Front-end Performance

As usual, we are using our private instance of WebPagetest to get synthetic measurements of front-end load time. We use a DSL connection and test with IE8, IE9, Firefox, and Chrome. The main difference with this report is that we have switched from measuring Document Complete to measuring Speed Index, since we believe that it provides a better representation of user perceived performance. To make sure that we are comparing with historical data, we pulled Speed Index data from October for the “old” numbers. Here is the data, and all of the numbers are medians over a 24 hour period:

Synthetic Front-End Performance December 2013

Start render didn’t really change at all, and speed index was up on some pages and down on others. Our search results page, which had the biggest increase on the backend, actually saw a 0.2 second decrease in speed index. Since this is a new metric we are tracking, we aren’t sure how stable it will be over time, but we believe that it provides a more accurate picture of what our visitors are really experiencing.

One of the downsides of our current wpt-script setup is that we don’t save waterfalls for old tests – we only save the raw numbers. Thus when we see something like a 0.5 second jump in Speed Index for the shop page, it can be difficult to figure out why that jump occurred. Luckily we are Catchpoint customers as well, so we can turn to that data to get granular information about what assets were on the page in October vs. December. The data there shows that all traditional metrics (render start, document complete, total bytes) have gone down over the same period. This suggests that the jump in speed index is due to loading order, or perhaps a change in what’s being shown above the fold. Our inability to reconcile these numbers illustrates a need to have visual diffs, or some other mechanism to track why speed index is changing. Saving the full WebPagetest results would accomplish this goal, but that would require rebuilding our EC2 infrastructure with more storage – something we may end up needing to do.

Overall we are happy with the switch to speed index for our synthetic front-end load time numbers, but it exposed a need for better tooling.

Real User Front-end Performance

These numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

Real User Front-end Performance December 2013

There aren’t any major changes here, just slight movement that is largely within rounding error. The one outlier is search, especially since our synthetic numbers showed that it got faster. This illustrates the difference between measuring onload, which mPulse does, and measuring speed index, which is currently only present in WebPagetest. This is one of the downsides of Real User Monitoring – since you want the overhead of measurement to be low, the data that you can capture is limited. RUM excels at measuring things like redirects, DNS lookup times, and time to first byte, but it doesn’t do a great job of providing a realistic picture of how long the full page took to render from the customer’s point of view.


We have a backend regression to investigate, and front-end tooling to improve, but overall there weren’t any huge surprises. Etsy’s performance is still pretty good relative to the industry as a whole, and relative to where we were a few years ago. The challenge going forward is going to center around providing a great experience on mobile devices and for international users, as the site grows and becomes more complex.