Culture of Quality: Measuring Code Coverage at Etsy

Posted by on February 15, 2018 / 5 Comments

In the summer of 2017, Etsy created the Test Engineering Strategy Team to define, measure, and develop practices that would help product engineering teams ship quality code. One of our first goals was to find a way to establish a quantifiable quality “baseline” and track our progress against this baseline over time. We settled on code coverage because it provides both a single percentage number as well as line-by-line detailed information for engineers to act on. With over 30,000 unit tests in our test suite, this became a difficult challenge.

Code coverage is the measure of how much code is being executed by your test suite. This blog post will walk you through the process of implementing coverage across our various platforms. We are not going to weigh in on the debate over the value of code coverage in this blog post.

We approached measuring code coverage with 5 requirements:

  1. Generate code coverage metrics by file for PHP, Android, and iOS code bases
  2. Where possible, use existing tools to minimize impact on existing test and deploy infrastructure
  3. Create an automated and reliable hands-off process for measurement and reporting
  4. Surface an accurate measurement as a KPI for code owners
  5. Promote wide dissemination of the results to increase awareness

Setup and Setbacks

Our first step was collecting information on which tools were already being utilized in our deployment pipeline. A surprise discovery was that Android code coverage had already been implemented in our CI pipeline using JaCoCo, but had been disabled earlier in the year because it was impacting deploy times. We re-enabled the Jenkins job and set it to run on a weekly schedule instead of each deploy to the master trunk. Impact on deployment would be a recurring theme throughout coverage implementation.

There was some early concern about the accuracy of the coverage report generated by JaCoCo. We noticed discrepancies in the reports when viewed through the Jenkins JaCoCo plugin and the code coverage window in Android Studio. We already have a weekly “Testing Workshop” scheduled where these kinds of problems are discussed. Our solution was to sit down with engineers during the workshops and review the coverage reports for differences. A thorough review found no flaws in the reports, so we moved forward with relying on JaCoCo as our coverage generation tool.

For iOS, xcodebuild has built-in code coverage measurement. Setting up xcodebuild’s code coverage is a one-step process of enabling coverage in the target scheme using Xcode. This provides an immediate benefit to our engineers, as Xcode supplies instant feedback in the editor on which code is covered by tests. We measured the performance hit that our unit test build times would take from enabling coverage in CI and determined that it would be negligible. The jobs have a high variance build time with a mean(μ) swing of about 40 seconds. When jobs with coverage enabled were added to the sample, it did not have a noticeable effect on the build time trend. The trend is continually monitored and will be adjusted if the impact becomes detrimental to our deployment process.

Our PHP unit tests were being executed by PHPUnit, an industry standard test runner, which includes an option to run tests with coverage. This was a great start. Running the tests with coverage required us to enable Xdebug, which can severely affect the performance of PHP. The solution was to employ an existing set of containers created by our Test Deployment Infrastructure team for smaller tasks (updating dashboards, running build metric crons, etc.). By enabling Xdebug on this small subset of containers we could run coverage without affecting the entire deployment suite. After this setup was completed, we attempted to run our unit tests with coverage on a sample of tests. We quickly found that executing tests with coverage would utilize a lot of RAM, which would cause the jobs to fail. Our initial attempt to solve the problem was to modify the “memory_limit” directive in the PHP.ini file to grant more RAM to the process. We were allowing up to 1024mb of RAM for the process, but this was still unsuccessful. Our eventual solution was to prepend our shell execution with php -d "memory_limit=-1" to free up all available memory for the process.

Even after giving the process all of the memory available to the container, we were still running into job failures. Checking coverage for ~30,000 tests in a single execution was too problematic. We needed a way to break up our coverage tests into multiple executions.

Our PHP unit tests are organized by directory (cart, checkout, etc.). In order to break up the coverage job, we wrote a simple shell script that iterates over each directory and creates a coverage report.

for dir in */ ; do
if [[ -d "$dir" && ! -L "$dir" ]]; then
Coverage Command $dir;

Once we had the individual coverage reports, we needed a way to merge them. Thankfully, there is a second program called PHPCov that will combine reports generated by PHPunit. A job of this size can take upwards of four hours in total, so checking coverage on every deploy was out of the question. We set the job to run on a cron alongside our Android and iOS coverage jobs, establishing a pattern of checking code coverage on Sunday nights.

Make It Accessible

After we started gathering code coverage metrics, we needed to find a way to share this data with the owners of the code. For PHP, the combined report generated by PHPCov is a hefty 100mb XML file in the Clover format. We use a script written in PHP to parse the XML into an array and then output that array as a CSV file. With that done, we copy the data to our Vertica data warehouse for easy consumption by our engineering teams.

For iOS, our first solution was to use a ruby gem called Slather to generate HTML reports. We followed this path for many weeks and implemented automated Slather reporting using Jenkins. Initially this seemed like a viable solution, but when reviewed by our engineers we found some discrepancies between the coverage report generated by Slather and the coverage report generated by Xcode. We had to go back to the drawing board and find another solution. We found another ruby gem, xcov, that directly analyzed the .xccoverage file generated by xcodebuild when running unit tests. We could parse the results of this coverage report into a JSON data object, then convert it into CSV and upload it to Vertica.

For Android, the JaCoCo gradle plugin is supposed to be able to generate reports in multiple formats, including CSV. However, we could not find a working solution for generating reports using the gradle plugin. We spent a considerable amount of time debugging this problem and eventually realized that we were yak shaving. We discarded base assumptions and looked for other solutions. Eventually we decided to use the built-in REST api provided by Jenkins. We created a downstream job, passed it the build number for the JaCoCo report, then used a simple wget command to retrieve and save the JSON response. This was converted into a CSV file and (once again) uploaded to Vertica.

Once we had the coverage data flowing into Vertica we wanted to get it into the hands of our engineering teams. We used Superbit, our in-house business intelligence reporting tool, to make template queries and dashboards that provided examples of how to surface relevant information for each team. We also began sending a weekly email newsletter to the engineering and project organizations highlighting notable changes to our coverage reports.

To be continued…

Measuring code coverage is just one of many methods the Test Engineering Strategy Team is using to improve the culture of quality in Etsy Engineering. In a future blog post we will be discussing the other ways in which we measure our code base and how we use this reporting to gain confidence in the code that we ship.


Selecting a Cloud Provider

Posted by on January 4, 2018 / 13 Comments and most of our related services have been hosted in self-managed data centers since the first Etsy site was launched in 2005. Earlier this year, we decided to evaluate migrating everything to a cloud hosting solution. The decision to run our own hardware in data centers was the right decision at the time, but infrastructure as a service (IaaS) and platform as a service (PaaS) offerings have changed dramatically in the intervening years. It was time to reevaluate our decisions. We recently announced that we have selected Google Cloud Platform (GCP) as our cloud provider and are incredibly excited about this decision. This marks a shift for Etsy from infrastructure self-reliance to a best-in-class service provider. This shift allows us to spend less time maintaining our own infrastructure and more time on strategic features and services that make the Etsy marketplace great.

Although we use the term ‘vendor’ when referring to a cloud provider, we viewed this as much more than a simple vendor selection process. We are entering into a partnership and a long-term relationship. The provider that we have chosen is going to be a major part of our successful initial migration, as well as a partner in the long-term scalability and availability of our site and services. This was not a decision that we wanted to enter into without careful consideration and deliberate analysis. This article will walk you through the process by which we vetted and ultimately selected a partner. We are not going to cover why we chose to migrate to a cloud hosting provider nor are we going to cover the business goals that we have established to measure the success of this project.

From One, Many

While the migration to a cloud hosting provider can be thought of as a single project, it really is one very large project made up of many smaller projects. In order to properly evaluate each cloud provider accurately, we needed to identify all of the sub-projects, determine the specific requirements of each sub-project, and then use these requirements to evaluate the various providers. Also, to scope the entire project, we needed to determine the sequence, effort, dependencies, priority, and timing of each sub-project.

We started by identifying eight major projects, including the production render path for the site, the site’s search services, the production support systems such as logging, and the Tier 1 business systems like Jira. We then divided these projects further into their component projects—MySQL and Memcached as part of the production render path, for example. By the end of this exercise, we had identified over 30 sub-projects. To determine the requirements for each of these sub-projects, we needed to gather expertise from around the organization. No one person or project team could accurately, or in a timely enough manner, gather all of these requirements. For example, we not only needed to know the latency tolerance of our MySQL databases but also our data warehousing requirement for an API to create and delete data. To help gather all of these requirements, we used a RACI model to identify subproject ownership.


The RACI model is used to identify the responsible, accountable, consulted, and informed people for each sub-project. We used the following definitions:

Each Responsible person owned the gathering of requirements and the mapping of dependencies for their sub-project. The accountable person ensured the responsible person had the time, resources, and information they needed to complete the project and ultimately signed off that it was done.

Architectural Review

Etsy has long used an architectural review process, whereby any significant change in our environment, whether a technology, architecture, or design, undergoes a peer review. As these require a significant contribution of time from senior engineers, the preparation for these is not taken lightly. Experts across the organization collaborated to solicit diverse viewpoints and frequently produced 30+ page documents for architectural review.

We determined that properly evaluating various cloud providers required an understanding of the desired end state of various parts of our system. For example, our current provisioning is done in our colocation centers using a custom toolset as automation to build bare-metal servers and virtual machines (VMs). We also use Chef roles and recipes applied on top of provisioned bare-metal servers and VMs. We identified a few key goals for choosing a tool for infrastructure creation in the cloud including: greater flexibility, accountability, security, and centralized access control. Our provisioning team evaluated several tools against these goals, discussed them, and then proposed new workflows in an architecture review. The team concluded by proposing we use Terraform in combination with Packer to build the base OS images.

Ultimately, we held 25 architectural reviews for major components of our system and environments. We also held eight additional workshops for certain components we felt required more in-depth review. In particular, we reviewed the backend systems involved in generating pages of (a.k.a. the production render path) with a greater focus on latency constraints and failure modes. These architectural reviews and workshops resulted in a set of requirements that we could use to evaluate the different cloud providers.

How it Fits Together

Once we had a firm-ish set of requirements for the major components of the system, we began outlining the order of migration. In order to do this we needed to determine how the components were all interrelated. This required us to graph dependencies, which involved teams of engineers huddled around whiteboards discussing how systems and subsystems interact.

The dependency graphing exercises helped us identify and document all of the supporting parts of each major component, such as the scheduling and monitoring tools, caching pools, and streaming services. These sessions ultimately resulting in high-level estimates of project effort and timing, mapped in Gantt-style project plans.


Earlier in the year, we ran some of our Hadoop jobs on one of the cloud providers’ services, which gave us a very good understanding of the effort required to migrate and the challenges that we would face in doing so at scale. For this initial experiment, however, we did not use GCP, so we didn’t have the same level of understanding of the cloud provider we ultimately chose.

We therefore undertook an experiment to enable batch jobs to run on GCP utilizing Dataproc and Dataflow. We learned a number of lessons from this exercise including that some of the GCP services were still in alpha release and not suitable for the workloads and SLAs that we required. This was the first of many similar decisions we’ll need to make: use a cloud service or build our own tool. In this case, we opted to implement Airflow on GCP VMs. To help us make these decisions we evaluated the priority of various criteria, such as vendor support of the service, vendor independence, and impact of this decision on other teams.

There is no right answer to these questions. We believe that these questions and criteria need to be identified for each team to consider them and make the best decision possible. We are also not averse to revisiting these decisions in the future when we have more information or alpha and beta projects roll out into general availability (GA).


Over the course of five months, we met with the Google team multiple times. Each of these meetings had clearly defined goals, ranging from short general introductions of team members to full day deep dives on various topics such as using container services in the cloud. In addition to providing key information, these meetings also reinforced a shared engineering culture between Etsy and Google.

We also met with reference customers to discuss what they learned migrating to the cloud and what to look out for on our journey. The time spent with these companies was unbelievably worthwhile and demonstrated the amazing open culture that is pervasive among technology companies in NY.

We also met with key stakeholders within Etsy to keep them informed of our progress and to invite their input on key directional decisions such as how to balance the mitigation of financial risk with the time-discounted value of cost commitments. Doing so provided decision makers with a shared familiarity and comfort with the process and eliminated the possibility of surprise at the final decision point.

The Decision

By this point we had thousands of data points from stakeholders, vendors, and engineering teams. We leveraged a tool called a decision matrix that is used to evaluate multiple-criteria decision problems. This tool helped organize and prioritize this data into something that could be used to impartially evaluate each vendor’s offering and proposal. Our decision matrix contained over 200 factors prioritized by 1,400 weights and evaluated more than 400 scores.

This process began with identifying the overarching functional requirements. We identified relationship, cost, ease of use, value-added services, security, locations, and connectivity as the seven top-level functional requirement areas. We then listed every one of the 200+ customer requirements (customers referring to engineering teams and stakeholders) and weighted them by how well each supported the overall functional requirements. We used a 0, 1, 3, 9 scale to indicate the level of support. For example, the customer requirement of “autoscaling support” was weighted as a 9 for cost (as it would help reduce cost by dynamically scaling our compute clusters up and down), a 9 for ease of use (as it would keep us from having to manually spin up and down VMs), a 3 for value-added services (as it is an additional service offered beyond just basic compute and storage but it isn’t terribly sophisticated), and a 0 for supporting other functional requirements. Multiple engineering teams performed an in-depth evaluation and weighting of these factors. Clear priorities began to emerge as a result of the nonlinear weighting scale, which forces overly conservative scorers to make tough decisions on what really matters.  

We then used these weighted requirements to rank each vendor’s offering. We again used a 0, 1, 3, 9 scale for how well each cloud vendor met the requirement. Continuing with our “autoscaling support” example, we scored each cloud vendor a 9 for meeting this requirement completely in that all the vendors we evaluated provided support for autoscaling compute resources as native functionality. The total scores for each vendor reached over 50,000 points with GCP exceeding the others by more than 10%.

As is likely obvious from the context, it should be noted that Etsy’s decision matrix (filled with Etsy’s requirements, ranked with Etsy’s weights by Etsy’s engineers) is applicable only to Etsy. We are not endorsing this decision as right for you or your organization, but rather we are attempting to provide you with insight into how we approach decisions such as these that are strategic and important to us.

Just the beginning

Now, the fun and work begin. This process took us the better part of five months with a full-time technical project manager, dozens of engineers and engineering managers, as well as several lawyers, finance personnel, and sourcing experts working on this part or in some cases full time. Needless to say, this was not an insignificant effort but really just the beginning of a multi-year project to migrate to GCP. We have an aggressive timeline to achieve our migration in about two years. We will do this while continuing to focus on innovative product features and minimizing risk during the transition period. We look forward to the opportunities that moving to GCP provides us and are particularly excited about this transformational change allowing us to focus more on core, strategic services for the Etsy marketplace by partnering with a best-in-class service provider.


VIPER on iOS at Etsy

Posted by on December 11, 2017 / 3 Comments


Consistent design patterns help Etsy move fast and deliver better features to our sellers and buyers. Typically we built features in a variety of ways, but found great benefits from the consistency VIPER brings.

Inconsistent development practices can make jumping between projects and features fairly difficult. When starting a new project, developers would have to understand the way a feature was implemented with very little shared base knowledge.

Just before we began working on the new Shop Settings in the Sell on Etsy app, VIPER was gaining steam amongst the iOS community and given the complexity of Shop Settings we decided to give this new pattern a try.

While the learning curve was steep and the overhead was significant (more on that later) we appreciated the flexibility it had over other architectures allowing it to be more consistently used throughout our codebase. We decided to to apply the pattern elsewhere in our apps while investigating solutions to some of VIPER’s drawbacks.

What is VIPER?

VIPER is a Clean Architecture that aims to separate out the concerns of a complicated view controller into a set of objects with distinct responsibilities.

VIPER stands for View, Interactor, Presenter, Entity, and Router, each with its own responsibility:

View: What is displayed

Interactor: Handles the interactions for each use case.

Presenter: Sets up the logic for what is displayed in the view

Entity: Model objects that are consumed by the Interactor

Router: Handles the navigation/data passing from one screen to the next

Example Implementation

As a contrived example let’s say we want a view that displays a list of people. Our VIPER implementation could be setup as follows

View: A simple UITableView set up to display UITableViewCells

Interactor: The logic for fetching Person objects

Presenter: Takes a list of Person objects and returns a list of only the names for consumption by the view

Entity: Person objects

Router: Presents a detail screen about a given Person

The interaction would look something like this:

  1. The main view controller would tell the Interactor to fetch the Person entities
  2. The Interactor would fetch the Person entities and tell the Presenter to present these entities
  3. The Presenter would consume these entities and generate a list of strings
  4. The Presenter would then tell the UITableView to reload with the given set of strings

Additionally if the user were to tap on a given Person object the Interactor would call to the Router to present the detail screen.

Overhead & the issues with VIPER

VIPER is not without issues. As we began to develop new features we ran into a few stumbling blocks.

The boilerplate for setting up the interactions between all the classes can be quite tedious:

// The Router needs a reference to the interactor for 
// any possible delegation that needs to be set up.
// A good example would be after a save on an edit 
// screen the interactor should be told to reload the data.
router.interactor = interactor

// This is needed for the presentation logic
router.controller = navigationController

// If the view has any UI interaction the presenter 
// should be responsible for passing this to the interactor
presenter.interactor = interactor

// Some actions that the Interactor may do may require 
// presentation/navigation logic
interactor.router = router

// The interactor will need to update the presenter when 
// a UI update is needed
interactor.presenter = presenter

As a result of all the classes involved and the interaction between them developing a mental model of how VIPER works is difficult. Also, the numerous references required between classes can easily lead to retain cycles if weak references are not used in the appropriate places.

VIPERBuilder and beyond

In an attempt to reduce the amount of boilerplate and lower the barrier for using VIPER we developed VIPERBuilder.

VIPERBuilder consists of a simple set of base classes for the Interactor, Presenter and Router (IPR) as well as a Builder class.

By creating an instance of the VIPERBuilder object with your IPR subclasses specified via Swift generics (as needed) the necessary connections are made to allow for consistent VIPER implementation.

Example within a UIViewController:

lazy var viperBuilder: VIPERBuilder<NewInteractor, NewPresenter, NewRouter> = {
    return VIPERBuilder(controller: self)

The VIPERBuilder classes intended to be subclassed are written in Objective C to allow subclassing in both Swift & Objective C. There is also an alternative builder object (VIPERBuilderObjC) that uses Objective C generics.

Since the boilerplate is fairly limited the only thing left to do is write your functionality and view code. The part of VIPERBuilder that we chose not to architect around is what code should actually go in the IPR subclasses. Instead we broke down our ideal use-cases for IPR subclasses into the Readme.

Additionally, we have an example project in the repo to give a better idea of how the separation of concerns work with VIPERBuilder.

The iOS team at Etsy has built many features with VIPER & VIPERBuilder over the past year: shop stats, multi-shop checkout, search and the new seller dashboard just to name a few. Following these guidelines in our codebase has lessened the impact of context switching between features implemented in VIPER. We are super excited to see how the community uses VIPERBuilder and how it changes over time.


How Etsy caches: hashing, Ketama, and cache smearing

Posted by on November 30, 2017

At Etsy, we rely heavily on memcached and Varnish as caching tiers to improve performance and reduce load. Database and search index query results, expensive calculations, and more are stored in memcached as a look-aside cache read from PHP and Java. Internal HTTP requests between API servers are sent through and cached in Varnish. As traffic has grown, so have our caching pools. Today, our caching clusters store terabytes of cached data and handle millions of lookups per second.

At this scale, a single cache host can’t handle the volume of reads and writes that we require.  Cache lookups come from thousands of hosts, and the total volume of lookups would saturate the network and CPU of even the beefiest hardware—not to mention the expense of a server with terabytes of RAM.  Instead, we scale our caches horizontally, by building a pool of servers, each running memcached or Varnish.

In memcached, the cache key can be (almost) any string, which will be looked up in the memcached server’s memory.  As an HTTP cache, Varnish is a bit more complex: requests for cacheable resources are sent directly to Varnish, which constructs a cache key from the HTTP request.  It uses the URL, query parameters, and certain headers as the key.  The headers to use in the key are specified with the Vary header; for instance, we include the user’s locale header in the cache key to make sure cached results have the correct language and currency.  If Varnish can’t find the matching value in its memory, it forwards the request to a downstream server, then caches and returns the response.

With large demands for both memcached and varnish, we want to make efficient use of the cache pool’s space.  A simple way to distribute your data across the cache pool is to hash the key to a number, then take it modulo the size of the cache pool to index to a specific server.  This makes efficient use of your cache space, because each cached value is stored on a single host—the same host is always calculated for each key.

Consistent hashing for scalability

A major drawback of modulo hashing is that the size of the cache pool needs to be stable over time.  Changing the size of the cache pool will cause most cache keys to hash to a new server.  Even though the values are still in the cache, if the key is distributed to a different server, the lookup will be a miss.  That makes changing the size of the cache pool—to make it larger or for maintenance—an expensive and inefficient operation, as performance will suffer under tons of spurious cache misses.

For instance, if you have a pool of 4 hosts, a key that hashes to 500 will be stored on pool member 500 % 4 == 0, while a key that hashes to 1299 will be stored on pool member 1299 % 4 == 3.  If you grow your cache by adding a fifth host, the cache pool calculated for each key may change. The key that hashed to 500 will still be found on pool member 500 % 5 == 0, but the key that hashed to 1299 be on pool member 1299 % 5 == 4. Until the new pool member is warmed up, your cache hit rate will suffer, as the cache data will suddenly be on the ‘wrong’ host. In some cases, pool changes can cause more than half of your cached data to be assigned to a different host, slashing the efficiency of the cache temporarily. In the case of going from 4 to 5 hosts, only 20% of cache keys will be on the same host as before!Illustration of key movement with modulo hashing during pool changes

A fixed-size pool, with changes causing the hit rate to suffer, isn’t acceptable. At Etsy, we make our memcached pools more flexible with consistent hashing, implemented with Ketama. Instead of using the modulo operation to map the hash output to the cache pool, Ketama divides the entire hash space into large contiguous buckets, one (or more) per cache pool host. Cache keys that hash to each range are looked up on the matching host.  This consistent hashing maps keys to hosts in such a way that changing the pool size only moves a small number of keys—“consistently” over time and pool changes.

As a simplified example, with 4 hosts and hash values between 0 and 2^32-1, Ketama might map all values in [0, 2^30) to pool member 0, [2^30, 2^31) to pool member 1, and so on. Like modulo, this distributes cache keys evenly and consistently among the cache hosts. But when the pool size changes, Ketama results in fewer keys moving to a different host.

In the case of going from four hosts to five, Ketama would now divide the hash outputs into 5 groups. 50% of the cache keys will now be on the same host as previously. This leads to fewer cache misses and more efficiency during pool changes.Illustration of key movement with Ketama hashing during pool changes

In reality, Ketama maps each host to dozens or hundreds of buckets, not just one, which increases the stability and evenness of hashing. Ketama hashing is the default in PHP’s memcached library, and has served Etsy well for years as our cache has grown. The increased efficiency of consistently finding keys across pool changes avoids expensive load spikes.

Hot keys and cache smearing

While consistent hashing is efficient there is a drawback to having each key exist on only one server. Different keys are used for different purposes, and certain cache items will be read and written more than others. In our experience, cache key read rates follow a power law distribution: a handful of keys are used for a majority of reads, while the majority of keys are read a small number of times. With a large number of keys and a large pool, a good hash function will distribute the “hot keys”—the most active keys—among many servers, smoothing out the load.

However, it is possible for a key to be so hot—read so many times—that it overloads its cache server. At Etsy, we’ve seen keys that are hit often enough, and store a large enough value, that they saturate the network interface of their cache host. The large number of clients are collectively generating a higher read rate than the cache server can provide.  We’ve seen this problem many times over the years, using mctop, memkeys, and our distributed tracing tools to track down hot keys.

Further horizontal scaling by adding more cache hosts doesn’t help in this case, because it only changes the distribution of keys to hosts—at best, it would only move the problem key and saturate a different host. Faster network cards in the cache pool also help, but can have a substantial hardware cost. In general, we try to use large numbers of lower-cost hardware, instead of relying a small number of super-powered hosts.

We use a technique I call cache smearing to help with hot keys. We continue to use consistent hashing, but add a small amount of entropy to certain hot cache keys for all reads and writes. This effectively turns one key into several, each storing the same data and letting the hash function distribute the read and write volume for the new keys among several hosts.

Let’s take an example of a hot key popular_user_data. This key is read often (since the user is popular) and is hashed to pool member 3. Cache smearing appends a random number in a small range (say, [0, 8)) to the key before each read or write. For instance, successive reads might look up popular_user_data3, popular_user_data1, and popular_user_data6. Because the keys are different, they will be hashed to different hosts. One or more may still be on pool member 3, but not all of them will be, sharing the load among the pool.

Cache smearing (named after the technique of smearing leap seconds over time) trades efficiency for throughput, making a classic space-vs-time tradeoff. The same value is stored on multiple hosts, instead of once per pool, which is not maximally efficient. But it allows multiple hosts to serve requests for that key, preventing any one host from being overloaded and increasing the overall read rate of the cache pool.

We manually add cache smearing to only our hottest keys, leaving keys that are less read-heavy efficiently stored on a single host. We ‘smear’ keys with just a few bits of entropy, that is, with a random number in a small range like 0 to 8 or 0 to 16. Choosing the size of that range is its own tradeoff. A too-large range duplicates the cached data more than necessary to reduce load, and can lead to data inconsistencies if a single user sees smeared values that are out-of-sync between multiple requests. A too-small range doesn’t spread the load effectively among the cache pool (and could even result in all the smeared keys still being hashed to the same pool member).

We’ve successfully smeared both memcached keys (by appending entropy directly to the key) and Varnish keys (with a new query parameter for the random value). While we could improve the manual process to track down and smear hot keys, the work we’ve done so far has been integral to scaling our cache pools to their current sizes and beyond.

The combination of consistent hashing and cache smearing has been a great combination of efficiency and practicality, using our cache space well while ameliorating the problems caused by hot keys. Keep these approaches in mind as you scale your caches!

You can follow me on Twitter at @kevingessner

Comments Off on How Etsy caches: hashing, Ketama, and cache smearing

Reducing Image File Size at Etsy

Posted by on May 30, 2017 / 11 Comments

Serving images is a critical part of Etsy’s infrastructure. Our sellers depend on displaying clean, visually appealing pictures to showcase their products. Buyers can learn about an item by reading its description, but there’s a tangibility to seeing a picture of that custom necklace, perfectly-styled shirt, or one-of-a-kind handbag you’ve found.

Our image infrastructure at its core is simple. Users will upload images, we convert them to jpegs on dedicated processing machines (ImgWriters) and store them in large datastores (Mediastores). Upon request, we have specialized hosts (ImgCaches) that fetch the images from the Mediastores and send them to our users.

A simplified view of our Photos architecture

Of course, there are many smaller pieces that make up our Photos infrastructure. For example, there’s load balancing involved with our ImgWriters to spread requests across the cluster, and our ImgCaches have a level of local in-memory caching for images that are frequently requested, coupled with a hashing scheme to reduce duplication of work. In front of that, there’s also caching on our CDN side. Behind the writers and caches, there’s some redundancy in our mediastores both for the sake of avoiding single point of failure scenarios and for scale.

Throughout this architecture, though, one of our biggest constraints is the size of the data we pass around. Our CDN’s will only be able to cache so much before they begin to evict images, at which point requests are routed back to our origin. Larger images will require heavier use of our CDN’s and drive up costs. For our users, as the size of images increases so too will the download time for a page, impacting performance and resulting in a suboptimal user experience. Likewise, any post-processing of images internally is more bandwidth we’re using on our networks and increases our storage needs. Smaller images help us scale further and faster.

As we’re using jpegs as our format, an easy way to reduce the data size of our images is to lower the jpeg quality value, compressing the contents of the image. Of course, as we compress further we’ll continue to reduce the visual quality of an image—edges will grow blurry and artifacts will appear. Below is an example of such. The image on the left has a jpeg quality of 92 while that on the right has one of 25. In this extreme case, we can see the picture of a bird on the right begins to show signs of compression – the background becomes “blocky” and there are “artifacts” around edges, such as around the bird and the blockiness of the background behind.


The left with jpeg quality 92, the right jpeg quality 25.

For most of Etsy’s history processing images, we’ve kept to a generally held constant that setting the quality of a jpeg image to 85 is assumed to be a “good balance”. Any larger and we probably won’t see enough of a difference to warrant all those extra bytes. But could we get away with lowering it to 80? Perhaps the golden number is 75? This is all very dependent upon the image a user uploads. A sunset with various gradients may begin to show the negative effects of lowering the jpeg quality before a picture of a ring on a white background. What we’d really like is to be able to determine what is a good jpeg quality to set per image.

To do so, we added an image comparison tool called dssim to our stack. Dssim computes how dissimilar two images are using the SSIM algorithm for estimating human vision. Given two (or more) images, it will compare the structural similarity and produce a value from 0 (exactly the same) to unbounded. Note that while the README states that it only compares two pngs, recent updates to the library have also allowed the use of comparison between jpegs.

With the user’s uploaded image as a base, we can do a binary search on the jpeg quality to find a balance between the visual quality (how good it looks) to the jpeg quality (the value that can help determine compression size). If we’ve assumed that 85 is a good ceiling for images, we can try lowering the quality to 75. Comparing the image at 85 and 75, if the value returned is higher than a known threshold of what is considered “good” for a dssim score, we can raise the jpeg quality back up to half way between, 80. If it’s below a threshold, maybe we can try push the quality a little lower at say 70. Either way, we continue to compare until we get to a happy medium.

This algorithm is adapted from Tobias Baldauf’s work in compressing jpegs. We first tested compression using this bash script before running into some issues with speed. While this would do well for offline asynchronous optimizations of images, we wanted to keep ours synchronous so that users could see the same image on upload as they would when it was displayed to users. We were breaking 15-20 seconds on larger images by shelling out from PHP and so dug in a bit to look for some performance wins.

Part of the slowdown was that cjpeg-dssim was converting from jpeg to png to compare with dssim. Since the time the original bash script was written, though, there had been an update to allow dssim to compare jpegs. We cut off half the time or more by not having to do the conversions each way. We also ported the code to PHP, as we already had the image blob in memory and could avoid some of the calls reading and writing the image to disk each time. We also adjusted to be a bit more conservative with some of the resizing, limiting it to a range of jpeg quality between 85 and 70 to short circuit the comparisons by starting at an image quality of 78 and breaking out if we went above 85 or below 70. As we cut numerous sizes for an image, we could in theory perform this for each resized image, but we rarely saw the jpeg quality value differ and so use the calculated jpeg quality for one size on all cropped and resized images of a given upload.

As you’d expect, we’re adding work to our image processing and so had to take that into account. We believed that the additional upfront cost in time during the upload would pay for itself many times over. The speed savings in downloading images means an improvement in the time to render the important content on a page. We were adding 500ms to our processing time on average and about 800ms for the upper 90th percentile, which seemed within acceptable limits.

Time to calculate quality

This will vary, though, on the dimensions of the image you’re processing. We’re choosing an image we display often, which is 570px in width and a variable height (capped at 1500px). Larger images will take longer to process.

As we strive to Graph All the Things™, we kept track of how the jpeg quality changed from the standard 85 to vary from image to image, settling at roughly 73 on average.

The corresponding graphs for bytes stored, as expected, dropped as well. We saw reductions often ranging from 25% to 30% in the size of images stored, comparing the file sizes of images in April of 2016 with those in April of 2017 on the same day:

Total file size of images in bytes in 1 week in April, 2016 vs 2017

Page load performance

So how does that stack up to page load on Etsy? As an example, we examined the listing page on Etsy, which displays items in several different sizes: a medium size when the page first loads, a large size when zoomed in, and several tiny thumbnails in the carousel as well as up to 8 on the right that show other items from the shop. Here is a breakdown of content downloaded (with some basic variations depending upon the listing fetched):

MIME Type Bytes
js 6,119,491
image 1,022,820
font 129,496
css 122,731
html 86,744
other 3,838
Total 7,485,120

At Etsy, we frontload our static assets (js, fonts, css) and share it throughout the site, though, so that is often cached when a user reaches the listings page. Removing those three, which total 6,371,718 bytes, we’re left with 1,113,402 bytes, of which roughly 92% is comprised of image data. This means once a user has loaded the site’s assets, the vast majority of their download is then the images we serve on a given page as they navigate to each listing. If we apply our estimate of 23% savings to images, reducing image data to 787,571 and the page data minus assets to 878,153, on a 2G mobile device (roughly 14.4 kbps) it’s the difference in downloading a page in 77.3 seconds vs 60.8 seconds. For areas without 4G service, that’s a big difference! This of course is just one example page and the ratio of image size to page size will vary depending upon the listing.

Images that suffer more

Not all images are created equal. What might work for a handful of images is not going to be universal. We saw that some images in particular have a greater drop in jpeg quality while also suffering visual quality in our testing. Investigating, there are a few cases where this system could use improvements. In particular, jpeg quality typically suffers more when images contain line art or thin lines, text, and gradients, as the lossy compression algorithms for jpegs general don’t play well in these contexts. Likewise, images that have a large background area of a solid color and a small focus of the product would suffer, as the compression would run heavily on large portions of empty space in the image and then over compress the focus of the image. This helped inform our lower threshold of 70 for jpeg quality, at which we decided further compression wasn’t worth the loss in visual quality.

The first uncompressed, the second compressed to jpeg quality 40.


Other wins

Page load speed isn’t the only win for minifying images, as our infrastructure also sees benefits from reducing the file size of images. CDN’s are a significant piece of our infrastructure, allowing for points of presence in various places in the world to serve static assets closer to the location of request and to offload some of the hits to our origin servers. Small images means we can cache more files before running into eviction limits and reduce the amount we’re spending on CDN’s. Our internal network infrastructure also scales further as we’re not pushing as many bytes around the line.


Availability of the tool and future work

We’re in the process of abstracting the code behind this into an open source tool to plug into your PHP code (or external service). One goal of writing this tool was to make sure that there was flexibility in determining what “good” is, as that can be very subjective to users and their needs. Perhaps you’re working with very large images and are willing to handle more compression. The minimum and maximum jpeg quality values allow you to fine tune the amount of compression that you’re willing to accept. On top of that, the threshold constants we use to narrow in when finding the “sweet spot” with dssim are configurable, giving you leverage to adjust in another direction. In the meantime, following the above guidelines on using a binary search approach combined with dssim should get you started.



Not all of the challenges behind these adjustments are specifically technical. Setting the jpeg quality of an image with most software applications is straightforward, as it is often an additional parameter to include when converting images. Deploying such a change, though, requires extensive testing over many different kinds of images. Knowing your use cases and the spectrum of images that make up your site is critical to such a change. With such a shift, knowing how to roll back, even if it requires a backfill, can also be useful as a safety net. Lastly, there’s merit in knowing that tweaking images is rarely “finished”. Some users like an image that heavily relies on sharpening images, while another might want some soft blending when resizing. Keeping this in mind was important for us to find out how we can continue to improve upon the experience our users have, both in terms of overall page speed and display of their products.


How Etsy Ships Apps

Posted by on May 15, 2017 / 7 Comments

In which Etsy transforms its app release process by aligning it with its philosophy for web deploys

illustration of a boat on a wave. There are several people in the boat, and the boat has a flag with an "E" on it.

Anchors Aweigh

Deploying code should be easy. It should happen often, and it should involve its engineers. For Etsyweb, this looks like continuous deployment.

A group of engineers (which we call a push train) and a designated driver all shepherd their changes to a staging environment, and then to production. At each checkpoint along that journey, the members of the push train are responsible for testing their changes, sharing that they’re ready to ship, and making sure nothing broke. Everyone in that train must work together for the safe completion of their deployment. And this happens very frequently: up to 50 times a day.

                                               TOPIC: clear
 mittens> .join                                TOPIC: mittens
   sasha> .join with mittens                   TOPIC: mittens + sasha
 pushbot> mittens, sasha: You're up            TOPIC: mittens + sasha
   sasha> .good                                TOPIC: mittens + sasha*
 mittens> .good                                TOPIC: mittens* + sasha*
 pushbot> mittens, sasha: Everyone is ready    TOPIC: mittens* + sasha*
  nassim> .join                                TOPIC: mittens* + sasha* | nassim
 mittens> .at preprod                          TOPIC: <preprod> mittens + sasha | nassim
 mittens> .good                                TOPIC: <preprod> mittens* + sasha | nassim
   sasha> .good                                TOPIC: <preprod> mittens* + sasha* | nassim
 pushbot> mittens, sasha: Everyone is ready    TOPIC: <preprod> mittens* + sasha* | nassim
 mittens> .at prod                             TOPIC: <prod> mittens + sasha | nassim
 mittens> .good                                TOPIC: <prod> mittens* + sasha | nassim
     asm> .join                                TOPIC: <prod> mittens* + sasha | nassim + asm
   sasha> .good                                TOPIC: <prod> mittens* + sasha* | nassim + asm
     asm> .nm                                  TOPIC: <prod> mittens* + sasha* | nassim
 pushbot> mittens, sasha: Everyone is ready    TOPIC: <prod> mittens* + sasha* | nassim
 mittens> .done                                TOPIC: nassim
 pushbot> nassim: You're up                    TOPIC: nassim
    lily> .join                                TOPIC: nassim | lily

This strategy has been successful for a lot of reasons, but especially because each deploy is handled by the people most familiar with the changes that are shipping. Those that wrote the code are in the best position to recognize it breaking, and then fix it. Because of that, developers should be empowered to deploy code as needed, and remain close to its rollout.

App releases are a different beast. They don’t easily adapt to that philosophy of deploying code. For one, they have versions and need to be compiled. And since they’re distributed via app stores, those versions can take time to reach end users. Traditionally, these traits have led to strategies involving release branches and release managers. Our app releases started out this way, but we learned quickly that they didn’t feel very Etsy. And so we set out to change them.

Jen and Sasha

photo of Sasha Friedenberg and Jen MacchiarelliWe were the release managers. Jen managed the Sell on Etsy apps, and I managed the Etsy apps. We were responsible for all release stage transitions, maintaining the schedule, and managing all the communications around releases. We were also responsible for resolving conflicts and coordinating cross-team resources in cases of bugs and urgent blockers to release.

Ready to Ship

A key part of our job was making sure everyone knew what they’re supposed to do and when they’re supposed to do it. The biggest such checkpoint is when a release branches — this is when we create a dedicated branch for the release off master, and master becomes the next release. This is scheduled and determines what changes make it into production for a given release. It’s very important to make sure that those changes are expected, and that they have been tested.

For Jen and me, it would’ve been impossible to keep track of the many changes in a release ourselves, and so it was our job to coordinate with the engineers that made the actual changes and make sure those changes were expected and tested. In practice, this meant sending emails or messaging folks when approaching certain checkpoints like branching. And likewise, if there were any storm warnings (such as show-stopping bugs), it was our responsibility to raise the flag to notify others.

list of words indicating various responsibilities for release managersThen Jen left Etsy for another opportunity, and I became a single-point-of-failure and a gatekeeper. Every release decision was funneled through me, and I was the only person able to make and execute those decisions.

I was overwhelmed. Frustrated. I was worried I’d be stuck navigating iTunes Connect and Google Play, and sending emails. And frankly, I didn’t want to be doing those things. I wanted those things to be automated. Give me a button to upload to iTunes Connect, and another to begin staged rollout on Google Play. Thinking about the ease of deploying on web just filled me with envy.

This time wasn’t easy for engineers either. Even back when we had two release managers, from an engineer’s perspective, this period of app releases wasn’t transparent. It was difficult to know what phase of release we were in. A large number of emails was sent, but few of them were targeted to those that actually needed them. We would generically send emails to one big list that included all four of our apps. And all kinds of emails would get sent there. Things that were FYI-only, and also things that required urgent attention. We were on the path to alert-fatigue.

All of this meant that engineers felt more like they were in the cargo hold, rather than in the cockpit. But that just didn’t fit with how we do things for web. It didn’t fit with our philosophy for deployment. We didn’t like it. We wanted something better, something that placed engineers in front of the tiller.


screenshot showing Etsy's Ship's main page

So we built a vessel that coordinates the status, schedule, communications, and deploy tools for app releases. Here’s how Ship helps:

It’s hard to imagine all of that abstractly, so here’s an example:

Captain’s Log







Before Ship, all of these components above would’ve been performed manually. But you’ll notice that release managers are missing from the above script; have we replaced release managers with all the automations in Ship?

photo of Etsy listing featuring a flag used by nautical ships to communicate: the flag says "I require a pilot"

Release Drivers

Partially. Ship has a feature where each release is assigned a driver.

screenshot showing Etsy's Ship's driver section. The driver in this case is Hannah Mittelstaedt

This driver is responsible for a bunch of things that we couldn’t or shouldn’t automate. Here’s what they’re responsible for:

Everything else? That’s automated. Branching, release candidate generation, submission to iTunes Connect — even staged rollout on Google Play! But, we’ve learned from automation going awry before. By default, some things are set to manual. There are others for which Ship explicitly does not allow automation, such as continuing staged rollout on Google Play. Things like this should involve and require human interaction. For everything else that is automated, we added a failsafe: at any time, a driver can disable all the crons and take over driving from autopilot:

screenshot of Etsy's Ship showing the various automation options that can be set on a per-releases basis.

When a driver wants to do something manually, they don’t need access to iTunes Connect or Google Play, as each of these things is made accessible as a button. A really nice side effect of this is that we don’t have to worry about provisioning folks for either app store, and we have a clear log of every release-related action taken by drivers.

Drivers are assigned once a release moves onto master, and are semi-randomly selected based on previous drivers and engineers that have committed to previous releases. Once assigned, we send them an onboarding email letting them know what their responsibilities are:

screenshot of Etsy's Ship's driver onboarding email

Ready to Ship Again

The driver can remain mostly dormant until the day of branching. A couple hours before we branch, it’s the driver’s responsibility to make sure that all the impacting engineers are ready to ship, and to orchestrate efforts when they’re not. After we’re ready, the driver’s responsibility is to remain available as a point-of-contact while final testing takes place. If an issue comes up, the driver may be consulted for steps to resolve.

And then, assuming all goes well, comes release day. The driver can opt to manually release, or let the cron do this for them — they’ll get notified if something goes wrong, either way. Then a day after we release, the driver looks at all of our dashboards, logs, and graphs to confirm the health of the release.


But not all releases are planned. Things fail, and that’s expected. It’s naïve to assume some serious bug won’t ship with an app release. There’s plenty of things that can and will be the subject of a post-mortem. When one of those things happens, any engineer can spawn a bugfix release off the most-recently-released mainline release.

diagram showing how a bugfix branch comes off of an existing release branch and not master

The engineer that requests this bugfix gets assigned as the driver for that release. Once they branch the release, they make the necessary bugfixes (others can join in to add bugfixes too, if they coordinate with the driver) in the release’s branch, build a release candidate, test it, and get it ready for production. The driver can then release it at will.

State Machine

Releases are actually quite complicated.

diagram indicating the state machine that powers Etsy's Ship

It starts off as an abstract thing that will occur in the future. Then becomes a concrete thing actively collecting changes via commits on master in git. After this period of collecting commits, the release is considered complete and moves into its own dedicated branch. The release candidate is then built from this dedicated branch, which then gets thoroughly tested, and moved into production. The release itself then concludes as an unmerged branch.

Once a release branches, the next future release moves onto master. Each release is its own state machine, where the development and branching states overlap between successive releases.

Notifications: Slack and Email

screenshot showing a notification sent via Slack

Plugged into the output of Ship are notifications. Because there are so many points of interest en route to production, it’s really important that the right people are notified at the right times. So we use the state machine of Ship to send out notifications to engineers (and other subscribers) based on how much they asked to know, and how they impacted the release. We also allow anyone to sign up for notifications around a release. This is used by product managers, designers, support teams, engineering managers, and more. Our communications are very targeted to those that need or want them.

In terms of what they asked to know, we made it very simple to get detailed emails about state changes to a release:

screenshot of Etsy's Ship showing the subscription options for engineers and non-engineers on a per-release basis

In terms of how they impacted the release, we need to get that data from somewhere else.


We mentioned data Ship receives from outside sources. At Etsy, we use GitHub for our source control. Our apps have repos per-platform (Android and iOS). In order to keep Ship’s knowledge of releases up-to-date, we set up GitHub Webhooks to notify Ship whenever changes are pushed to the repo. We listen for two changes in particular: pushes to master, and pushes to any release branch.

When Ship gets notified, it iterates through the commits and uses the author, changed paths, and commit message to determine which app (buyer or seller) the commit affects, and which release we should attribute this change to. Ship then takes all of that and combines it into a state that represents every engineer’s impact on a given release. Is that engineer “user-impacting” or “dark” (our term for changes that aren’t live)? Ship then uses this state to determine who is a member of what release, and who should get notified about what events.

Additionally, at any point during a release, an engineer can change their status. They may want to do this if they want to receive more information about a release, or if Ship misunderstood one of their commits as being impacting to the release.


Everything up until has explained how Ship keeps track of things. But there’s been no explanation for how some of the automated actions affecting the app repo or things outside Etsy occur.

We have a home-grown tool for managing deploys called Deployinator, and we added app support. It can now perform mutating interactions with the app repos, as well as all the deploy actions related to Google Play and iTunes Connect. This is where we build the testing candidates, release candidate, branch the release, submit to iTunes Connect, and much more.

We opted to use Deployinator for a number of reasons:

In our custom stack, we have crons. This is how we branch on Tuesday evening (assuming everyone is ready). This is where we interface with Google Play and iTunes Connect. We make use of Google Play’s official API in a custom python module we wrote, and for iTunes Connect we use Spaceship to interface with the unofficial API.


The end result of Ship is that we’ve distributed release management. Etsy no longer has any dedicated release managers. But it does have an engineer who used to be one — and I even get to drive a release every now and then.

People cannot be fully automated away. That applies to our web deploys, and is equally true for app releases. Our new process works within that reality. It’s unique because it pushes the limit of what we thought could be automated. Yet, at the same time, it empowers our app engineers more than ever before. Engineers control when a release goes to prod. Engineers decide if we’re ready to branch. Engineers hit the buttons.

And that’s what Ship is really about. It empowers our engineers to deliver the best apps for our users. Ship puts engineers at the helm.


Modeling Spelling Correction for Search at Etsy

Posted by on May 1, 2017 / 4 Comments


When a user searches for an item on Etsy, they don’t always type what they mean. Sometimes they type the query jewlery when they’re looking for jewelry; sometimes they just accidentally hit an extra key and type dresss instead of dress. To make their online shopping experience successful, we need to identify and fix these mistakes, and display search results for the canonical spelling of the actual intended query. With this motivation in mind, and through the efforts of the Data Science, Linguistic Tools, and the Search Infrastructure teams, we overhauled the way we do spelling correction at Etsy late last year.

The older service was based on a static mapping of misspelled words to their respective corrections, which was updated only infrequently. It would split a search query into its tokens, and replace any token that was identified as a misspelling with its correction. Although the service allowed a very fast way of retrieving a correction, it was not an ideal solution for a few reasons:

  1. It could only correct misspellings that had already been identified and added to the map; it would leave previously unseen misspellings uncorrected.
  2. There was no systematic process in place to update the misspelling-correction pairs in the map.
  3. There was no mechanism of inferring the word from the context of surrounding words. For instance, if someone intended to type butter dish, but instead typed buttor dish, the older system would correct that to button dish since buttor is a more commonly observed misspelling for button at Etsy.
  4. Additionally, the older system would not allow one-to-many mappings from misspellings to corrections. Storing both buttor ->  button and buttor -> butter mappings simultaneously would not be possible without modifying the underlying data structure.

We, therefore, decided to upgrade the spelling correction service to one based on a statistical model. This model learns from historical user data on Etsy’s website and uses the context of surrounding words to offer the most probable correction.

Although this blog post outlines some aspects of the infrastructure components of the service, its main focus is on describing the statistical model used by it.


We use a model that is based upon the Noisy Channel Model, which was historically used to infer telegraph messages that got distorted over the line. In the context of a user typing an incorrectly spelled word on Etsy, the “distortion” could be from accidental typos or a result of the user not knowing the correct spelling. In terms of probabilities, our goal is to determine the probability of the correct word, conditional on the word typed by the user. That probability, as per Bayes’ rule, is proportional to the product of two probabilities:

We are able to estimate the probabilities on the right-hand side of the relation above, using historical searches performed by our users. We describe this in more detail below.

To extend this idea to multi-word phrases, we use a simple Hidden Markov Model (HMM) which determines the most probable sequence of tokens to suggest as a spelling correction. Markov refers to being able to predict the future state of the system based solely on its present state. Hidden refers to the system having hidden states that we are trying to discover. In our case, the hidden states are the correct spellings of the words the user actually typed.

The main components of an HMM are explained via the figure below. Assume, for the purpose of this post, that the user searched for Flower Girl Baske when they intended to search for Flower Girl Basket. The main components of an HMM are then as follows:

  1. Observed states: the words explicitly typed by the user, Flower Girl Baske (represented as circles).
  2. Hidden states: the correct spellings of the words the user intended to type, Flower Girl Basket (represented as squares).
  3. Emission probability: the conditional probability of observing a state given the hidden state.
  4. Transition probability: the probability of observing a state conditional upon the probability of the immediately previous observed state, like, for instance, in the figure below, the probability of transitioning from Girl to Baske, i.e., P(Baske|Girl). The fact that this probability does not depend on the probability of having observed Flower, the state that precedes Girl, illustrates the Markov property of the model.

Once the spelling correction service receives the query, it first splits the query into three tokens, and then suggests possible corrections for each of the tokens as shown in the figure below.

In each column, one of the correction possibilities represents the true hidden state of the token that was typed (emitted) by the user. The most probable sequence can be thought of as identifying the true hidden state for each token given the context of the previous token. To accomplish this, we need to know probabilities from two distributions: first, the probability of typing a misspelling given the correct spelling of the word, the emission probability, and, second, the probability of transitioning from one observed word to another, the transition probability.

Once we are able to calculate all the emission probabilities, and the transition probabilities needed, as described in the sections below, we can determine the most probable sequence using a common dynamic programming algorithm known as the Viterbi Algorithm.

Error Model

The emission probability is estimated through an Error Model, which is a statistical model created by inferring historical corrections provided by our own users while making searches on Etsy. The heuristic we used is based on two conditions: a user’s search being followed immediately by another search similar to the first, and with the second search then leading to a listing click. We align these tokens to each other in a way that minimizes the number of edits needed on the misspelled query to transform it into the corrected second query.

We then make aligned splits on the characters, and calculate counts and associated conditional probabilities. We generate character level probabilities for four types of operations: substitution, insertion, deletion, and transposition. Of these, insertion and deletion also require us to also keep track of the context of neighboring letters — the probability of adding a letter after another, or removing a letter appearing after another, respectively.

To continue with the earlier example, when the user typed baske, the emitted token, they were probably trying to type basket, the hidden state for baske, which corresponds to the probability P(Baske|Basket). Error probabilities for tokens, such as these, are calculated by multiplying the character-level probabilities which are assumed to be independent. For instance, the probability of correcting the letter e by appending the letter t to it is given by:

Here, the numerator is the number of times in our dataset where we see an e being corrected to an et, and the denominator is the number of corrections where any letter was appended after e. Since we assume that the probability associated with a character remaining unchanged is 1, in our case, the probability P(Baske|Basket) is solely dependent on the insertion probability P(et|e).

Language Model

Transition probabilities are estimated through a language model which is determined by calculating the unigram and bigram token frequencies seen in our historical search queries.

A specific instance of that, from the chosen example, would be the probability of going from one token in the Flower (first) column, say Flow, to a token, say Girl, in the Girl (second) column, which is represented as P(Girl|Flow). We are able to determine that probability from the following ratio of bigram token counts to unigram token counts:

The error models and language models are generated through Scalding-powered jobs that run on our production hadoop cluster.

Viterbi Algorithm Heuristic

To generate inferences using these models, we employ the Viterbi Algorithm to determine the optimal query, i.e., sequence of tokens to serve as a spelling correction. The main idea is to present the most probable sequence as the spelling correction. The iterative process goes, as per the figure above, from the first column of tokens to the last. At the nth iteration, we have the sequence with the maximum probability available from the previous iteration, and we choose the token from the nth column which increases the maximum probability from the previous iteration the most.

Let’s explain this more concretely by describing the third iteration for our main example: assume that we already have Flower girl as the most probable phrase at the second iteration. Now we pick the token from the third column that corresponds to the maximum of the following transition probability and emission probability products:

We can see from the figure above that the transition probability from Girl to Basket is high enough to overcome the lower emission probability of someone typing Baske when they mean to type Basket, and that, consequently, Basket is the word we want to add to the existing Flower Girl phrase.

We now know that Flower Girl Basket is the most probable correction as predicted by our models. The decision about whether we suggest this correction to the user, and how we suggest it, is made on the basis of its confidence score. We describe that process a little later in the post.

Hyperparameters and Model Training

We added a few hyperparameters to the bare bones model we have described so far. They are as follows:

  1. We added an exponent to the emission probability because we want the model to have some flexibility in weighting the error model component in relation to the language model component.
  2. Our threshold for offering a correction is a linear function of the number of tokens in the query. We, therefore, have a slope and the intercept of the threshold as a pair of hyperparameters.
  3. Finally, we also have a parameter corresponding to the probability of words our language model does not know about.

Our training data set consists of two types of examples: the first kind are of the form misspelling -> correct spelling, like jewlery -> jewelry, while the second kind are of the form correct spelling -> _, like dress -> _. The second type corresponds to queries that are already correct, and therefore don’t need a correction. This setup enables our model to distinguish between the populations of both correct and incorrect spellings, and to offer corrections for the latter.

We tune our hyperparameters via 10-fold cross-validation using a hill-climbing algorithm that maximizes the f-score on our training data set. The f-score is the harmonic mean of the precision and the recall. Precision measures how many of the corrections offered by the system are correct and, for our data set, is defined as:

Recall is a measure of coverage — it tries to answer the question, “how many of the spelling corrections that we should be offering are we offering?”. It is defined as:

Optimizing on the f-score is allows us to increase coverage of the spelling corrections we offer while keeping the number of invalid corrections offered in check.

Serving the Correction

We serve three types of corrections at Etsy based on the confidence we have in the correction. We use the following odds-ratio as our confidence score:

If the confidence is greater than a certain threshold, we display search results corresponding to the correction, along with a “Search instead of” link to see the results for the original query instead. If the confidence is below a certain threshold, we offer results of the original query, with a “Did you mean” option to the suggested correction. If we don’t have any results for the original query, but do have results for the corrected query, we display results for the correction independent of the confidence score.

Spelling Correction Service

With the new service in place, when a search is made on Etsy, we fire off requests to both the new Java-based spelling service and the existing search service. The response from the spelling service includes both the correction and its confidence score, and how we display the correction, based on the confidence score, is described in the previous section. For high-confidence corrections, we make an additional request for search results with the corrected query, and display those results if their count is greater than a certain threshold.

When the spelling service receives a query, it first splits it into its constituent tokens. It then makes independent suggestions for each of the tokens, and, finally, strings together from those suggestions a sequence of the most probable tokens. This corrected sequence is sent back to the web stack as a response to the spelling service request.

The correction suggestions for each token are generated by Lucene’s DirectSpellChecker class which queries an index generated from tokens that are generated through a Hadoop job. The Hadoop job counts tokens from historical queries, rejecting any tokens that appear too infrequently or those that appear only in search queries with very few search results. We have a daily cron that ships the tokens, along with the generated Error and Language model files, from HDFS to the boxes that host the spelling correction service. A periodic rolling restart of the service across all boxes ensures that the freshest models and index tokens are picked up by the service.

Follow-up Work

The original implementation of the model was limited, in the sense, that it could not suggest corrections that were, at the token level, more than two edits away from the original token. We have subsequently implemented an auxiliary framework of suggesting correction tokens that fixes this issue.

Another limitation was related to splitting and compounding of tokens in the original query to make suggestions. For instance, we were not able to suggest earring as a suggestion for ear ring. Some of our coworkers are working on modifying the service to accommodate corrections of this type.

Although we do use supervised learning to train the hyperparameters of our models, since launching the service we have additional inputs from our users which we can improve our models with. Specifically, users clicking the “Did you mean” link on our low-confidence corrections results page provides us with explicit positive feedback, while clicks on the “Search instead for” link on our high-confidence corrections results page provides us with negative feedback. The next major evolution of the model would be to explicitly use this feedback to improve corrections.

(This project was a collaboration between Melanie Gin and Benjamin Russell on the Linguistic Tools team who built out the infrastructure and front-end; Caitlin Cellier and Mohit Nayyar on the Data Science team who worked on implementing the statistical model; and Zhi-Da Zhong on the Search Infrastructure team who consulted on infrastructure design. A special thanks to Caitlin whose presentation some of these figures are copied from.)


How Etsy Manages HTTPS and SSL Certificates for Custom Domains on Pattern

Posted by and on January 31, 2017 / 31 Comments

In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the retail marketplace while leveraging all of Etsy’s e-commerce tools.

The ability to point a custom domain to a Pattern site is an especially popular feature; many Pattern sites use their own domain name, either registered directly on the Pattern dashboard, or linked to Pattern from a third-party registrar.

At launch, Pattern shops with custom domains were served over HTTP, while checkouts and other secure actions happened over secure connections with This model isn’t ideal though; Google ranks pages with SSL slightly higher, and plans to increase the bump it gives to sites with SSL. That’s a big plus for our sellers. Plus, it’s the year 2017 – we don’t want to be serving pages over boring old HTTP if we can help it … and we really, really like little green lock icons.

Securify that address bar!

In this post we’ll be looking at some interesting challenges you run into when building a system designed to serve HTTPS traffic for hundreds of thousands of domains.

How to HTTPS

First, let’s talk about how HTTPS works, and how we might set it up for a single domain.

Let’s not talk that much about how HTTPS works though, because that would take a long time. Very briefly: by HTTPS we mean HTTP over SSL, and by SSL we mean TLS. All of this is a way for your website to communicate securely with clients. When a client first begins communicating with your site, you provide them with a certificate, and the client verifies that your certificate comes from a trusted certificate authority. (This is a gross over-simplification, here’s a very interesting post with more detail.)

Suffice to say, if you want HTTPS, you need a certificate, and if you want a certificate you need to get one from a certificate authority. Until fairly recently, this is the point where you had to open up your wallet. There were a handful of certificate authorities, each with an array of confusing options and pricey up-sells.


What's the _deal_ with this cert pricing?!


You pick whichever feels right to you, you get out your credit card, and you wind up with a certificate that’s valid for one to three years. Some CA’s offer API’s, but three years is a relatively long time, so chances are you do this manually. Plus, who even knows which cell in which SSL pricing matrix gets you API access?

From here, if you’re managing a limited number of domains, typically what you do is upload your certificate and private key to your load balancer (or CDN) and it handles TLS termination for you. Clients communicate with your load balancer over HTTPS using your domain’s public certificate, and your load balancer makes internal requests to your application.




And that’s it! Your clients have little green padlocks in their address bars, and your application is serving payloads just like it used to.

More certificates, more problems

Manually issuing certificates works well enough if your application is served off a handful of top-level domains. If you need certificates for lots and lots of TLDs, it’s not going to work. Have we mentioned that Pattern is available for the very reasonable price of $15 per month? A real baseline requirement for us is we can’t pay more for a certificate than someone pays for a year of Pattern. And we certainly don’t have time to issue all those certificates by hand.

Until fairly recently, our only options at this point would have been 1) do a deal with one of the big certificate authorities that offers API access, or 2) become our own certificate authority. These are both expensive options (we don’t like expensive options).

Let’s Encrypt to the rescue

Luckily for us, in April of 2016,  the Internet Security Research Group (ISRG), which includes founders from the Mozilla Foundation and the Electronic Frontier Foundation, launched a new certificate authority named Let’s Encrypt.

The ISRG has also designed a communication protocol, Automated Certificate Management Environment (ACME), that covers the process of issuing and renewing certificates for TLS termination. Let’s Encrypt provides its service for free, and exclusively via an API that implements the ACME protocol. (You can read about it in more detail here.)

This is great for our project: we found a certificate authority we feel really good about, and since it implements a freely published protocol, there are great open source libraries out there for interacting with it.

Building out certificate issuance

Let’s re-examine our problem now that we have a good way to get certificates:

  1. We have a large quantity of custom domains attached to existing Pattern sites; we need to generate certificates for all of those.
  2. We have a constant stream of new domain registrations for new Pattern sites; we’ll need to get new certificates on a rolling basis.
  3. Let’s Encrypt certificates last 90 days; we need to renew our SSL certificates fairly frequently.

All of these problems are solvable now! We chose an open source ACME client library for PHP called AcmePHP. Another exciting thing about Let’s Encrypt using an open protocol, there are open source server implementations as well as client implementations. So we were able to spin up an internal equivalent of Let’s Encrypt called boulder and do our development and testing inside our private network, without worrying about rate limits.




The service we built out to handle this ACME logic, store the certificates we receive, and keep track of which domains we have certificates for we named CertService. It’s a service that deals with certificates, you see. CertService communicates with Let’s Encrypt, and also exposes an API for other internal Etsy services. We’ll go into more detail on that later in this post.

TLS termination for lots and lots of domains

Now that we’ve build out a service that can have certificates issued, we need a place to put them, and we need to use them to do TLS termination for HTTPS requests to Pattern custom domains.

We handle this for * by putting our certificates directly onto our load balancers and giving them to our CDN’s. In planning this project, we thought about hundreds of thousands of certificates, each with a 90 day lifetimes. If we did a good job spacing out issuing and renewing our certificates, that would add up to thousands of daily write operations to our load balancers and CDN’s. That’s not a rate we were comfortable with; load balancer and CDN configuration changes are relatively high risk operations.

What we did instead, is create a pool of proxy servers, and use our load balancer to distribute HTTPS traffic to them. The proxy hosts handle the client SSL termination, and proxy internal requests to our web servers, much like the load balancer does for




Our proxy hosts run Apache and we leverage mod_ssl and mod_proxy to do TLS termination and proxy requests. To preserve client IP addresses, we make use of the PROXY protocol on our load balancer and mod_proxy_protocol on the proxy hosts. We’re also using mod_macro to avoid ever having to write out hundreds of thousands of virtual hosts declarations for hundreds of thousands of domains. All put together, it looks something like this:

ProxyPass               /  https://internal-web-vip/
ProxyPassReverse        /  https://internal-web-vip/

<Macro VHost $domain>
<VirtualHost *:443>
    ProxyProtocol On
    ServerName $domain

    SSLEngine on
    SSLCertificateFile      $domain.crt
    SSLCertificateKeyFile   $domain.key
    SSLCertificateChainFile lets-encrypt-cross-signed.pem

Use VHost
Use VHost

To connect all this together, our proxy hosts periodically query CertService for a list of recently modified custom domains. Each host then 1) fetches the new certificates from CertService, 2) writes them to disk, 3) regenerates a config like the one above, and 4) does a graceful restart of Apache. These restarts are staggered across our proxy pool so all but one of the hosts is available and receiving requests from the load balancer at any give time (fingers crossed).

How do we securely store lots of certificates?

Now that we’ve figured out how to programmatically request and renew SSL certificates via LetsEncrypt, we need to store these certificates in a secure way. To do this, there are some guarantees we need to make:

  1. Private keys are stored in a database segmented from other types of data
  2. Private keys encrypted at rest and never leave CertService in plaintext
  3. SSL key pair generation and LetsEncrypt communications take place only on trusted hosts
  4. Private keys can be retrieved only by the SSL terminating hosts

Guarantee #1 is a no-brainer. If an attacker were to compromise a datastore containing thousands of SSL private keys in plaintext, they would be be able to intercept critical data being sent to thousands of custom domains. Since security is about raising costs for an attacker – that is, making it harder for an attacker to succeed – we employ a number of techniques to secure our keys. Our first layer of defense is at an infrastructure level: private keys are stored in a MySQL database away from the rest of the network. We use iptables to limit who can connect to the MySQL server, and given that CertService is the only client that needs access, the scope is really narrow. This vastly reduces attack surface, especially in cases where an attacker is looking to pivot from another compromised server on the network. Iptables is also then used to lock down who can communicate to the CertService API; adding constraints to connectivity on top of a secure authentication scheme makes retrieving certificates that much more difficult. That addresses Guarantee #4.

Now that we’ve locked down access to the database, we need to make sure they’re stored encrypted. For this, we make use of a concept known as a hybrid cryptosystem. Hybrid cryptosystems, in a nutshell, combine asymmetric (public-key crypto) and symmetric cryptosystems. If you’re familiar with SSL, much of how we handle crypto here is analogous to Session Keys.

At the start of this process, we have two pieces of data: the SSL private key and its corresponding public key – the certificate. We don’t particularly care about the certificate since that is public by definition. We start by generating a domain specific AES-256 key and encrypt the SSL private key. This only technically addresses the issue of not having plaintext on disk; the encrypted SSL private key is stored right next to the AES key, which can be used to both encrypt and decrypt. An attacker who could steal the encrypted keys could also steal the AES key. To address this, we encrypt the AES key with a CertService public key. Now we have an encrypted SSL private key (encrypted with the AES-256 key) and an encrypted AES key (encrypted with CertService’s RSA-2048 public key). Now not only are keys truly stored encrypted on disk, they also cannot be decrypted at all on CertService. This means if an attacker were to break CertService’s authentication scheme, the most they would receive is an encrypted SSL private key; they would still need the CertService private key – available only on the SSL terminating hosts – to decrypt it. Now we’ve fully taken care of Guarantee #2.




The only Guarantee that remains is #3. If key generation were compromised, an attacker would be able to grab private keys before they were encrypted and stored. If LetsEncrypt communication were compromised, an attacker could use our keys to generate certificates for domains we’ve already authorized (they could technically authorize new ones, but that would be significantly more difficult) or even revoke certificates. Both of these cases would render the entire system untrustworthy. Instead, we limit this functionality to CertService and expose it as an API; that way, if the web server handling Pattern requests were broken into, the attacker would not be able to affect critical LetsEncrypt flows.




One of our stretch goals is to look into deploying HSMs. If there are bugs in the underlying software, the integrity of the entire system could be compromised thus voiding any guarantees we try to keep. While bugs are inevitable, moving critical cryptographic functions into secure hardware will mitigate their impact.

No cryptosystem is perfect, but we’ve reached our goal of significantly increasing attacker cost. On top of that, we’ve supplemented the cryptosystem with our usual host based alerting and monitoring. So not only will an attacker have to jump through several hoops to get those SSL key pairs, they will also have to do it without being detected.

After the build

With all of that wrapped up, we had a system to issue large numbers of certificates, securely store this data, and terminate TLS requests with it. At this point Etsy brought in a third-party security firm to do a round of penetration testing. This process finished up without finding any substantive security weaknesses, which gives us an added level of confidence in our system.

Once we’ve gained enough confidence, we will enable HSTS. This should be the final goal of any SSL rollout as it forces browsers to use encryption for all future communication. Without it, downgrade attacks could be used to intercept traffic and hijack session cookies.

Every Pattern site and linked domain now has a valid certificate stored in our system and ready to facilitate secure requests. This feature is rolled out across Pattern and all Pattern traffic is now pushed over HTTPS!

(This project was a collaboration between Ram Nadella, Andy Yaco-Mink and Nick Steele on the Pattern team; Omar and Ken Lee from Security; and Will Gallego from Operations. Big thanks to Dennis Olvany, Keyur Govande and Mike Adler for all their help.)


Optimizing Meta Descriptions, H1s and Title Tags: Lessons from Multivariate SEO Testing at Etsy

Posted by , and on January 25, 2017 / 20 Comments

We recently ran a successful split-testing title tag experiment to improve our search engine optimization (SEO), the results and methodology of which we shared in a previous Code as Craft post. In this post, we wanted to share some of our further learnings from our SEO testing. We decided to double down on the success of our previous experiment by running a series of further SEO experiments, one of which included changes to our:

Screen Shot 2017-01-25 at 10.42.39 AM

We found three surprising results:

Some notes on our methodology

For full details on the setup and methodology of our SEO split testing methodology, please see the previous Code as Craft post referenced above.

For this particular test, we used six unique treatments and two control groups. The two control groups remained consistent with each other before and after the experiment.


To derive the exact estimated causal changes effected in visits, we most often used Causal Impact modeling as implemented in the CausalImpact package by Google to standardize the test buckets against the control buckets. In some cases, the difference in differences method was used because it was more reliable in estimating effect sizes when working with strong seasonality swings.

This experiment was also impacted by strong seasonality and external event effects related to the holidays, sports events and the US elections. The statistical modeling in our final experiment analysis was adjusted for these effects to ensure accurate measurement of the causal effects of the test variants.


Takeaway #1: Short Title Tags Win Again

The results of this experiment aligned with the findings from our previous SEO title tag experiments, where it appeared shorter title tags drove more visits. We have so far validated our hypothesis that shorter title tags perform better in title tags in multiple separate experiments including many different variations and now feel quite confident that shorter title tags perform better (as measured by organic traffic) for Etsy. We hypothesize that this effect could be taking place through a number of different causal mechanisms:

  1.  Lower Levenshtein distance and/or higher percentage match to target search queries rewarded by Google’s search algorithm and thereby improving Etsy’s rankings in Google search results. Per Wikipedia: “the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.”
  2.  Shorter title tags, consisting only of the target search keyword, appear more relevant/enticing to search users

Takeaway #2: Meta Descriptions Matter

We found that changes in the meta description of a page can lead to statistically significant changes to visits. It appeared that longer and descriptive meta descriptions performed better and that conversely, shorter, terse meta descriptions performed worse. We hypothesize that longer meta descriptions might perform better via two possible causal mechanisms:

  1.  Longer meta descriptions take up more real estate in a search results page, improving CTRs
  2.  Longer meta descriptions give an appearance of more authority or more content, improving CTRs

Takeaway #3: H1s matter

We found that an H1 change can have a statistically significant impact on organic search traffic. However, changes in the H1 section of a page appear to interact with changes in title tags in hard to predict ways. For example, in this experiment, a title tag change in a certain variant increased visits. However, when the title tag change was combined with an H1 change, the positive effect of the title tag change was dulled, even though an H1 change by itself in a different variant led to slight increases in visits. This highlights the importance of SEO testing before rolling out even seemingly minor changes to Etsy pages.

Accounting For Unexpected Events

The Donald Trump Effect

One example of an event we had to control for (among a number of others) was the “Donald Trump effect”. We observed a large bucket skew on November 9 and 10, 2016 in one of our test groups. Upon investigation, it was found that the skew was due to large increases (+2000% to +5180%) in daily visits to pages related to “Donald Trump” the day after the US Presidential elections. Although the spikes in traffic to these pages were short lived, lasting only several days, they nevertheless did have the potential to unduly bias or reduce the statistical significance of the results of our experiment. These pages were therefore removed and controlled for when conducting the causal impact analyses for this experiment.



Our SEO experiments illustrate the importance of running SEO tests before making changes to a web page and continue to offer surprising results. However, it is important to note that the results of our experiments are only true for Etsy and do not necessarily reflect best practices for sites generally. We would therefore encourage everyone to discover the strategy that works best for their website through rigorous SEO testing.


Etsy’s Debriefing Facilitation Guide for Blameless Postmortems

Posted by on November 17, 2016 / 15 Comments

In 2012, I wrote a post for the Code As Craft blog about how we approach learning from accidents and mistakes at Etsy. I wrote about the perspectives and concepts behind what is known (in the world of Systems Safety and Human Factors) as the New View on “human error.” I also wrote about what it means for an organization to take a different approach, philosophically, to learn from accidents, and that Etsy was such an organization.

That post’s purpose was to conceptually point in a new direction, and was, necessarily, void of pragmatic guidance, advice, or suggestions on how to operationalize this perspective. Since then, we at Etsy have continued to explore and evolve our understanding of what taking this approach means, practically. For many organizations engaged in software engineering, the group “post-mortem” debriefing meeting (and accompanying documentation) is where the rubber meets the road.

Many responded to that original post with a question:

“Ok, you’ve convinced me that we have to take a different view on mistakes and accidents. Now: how do you do that?”

As a first step to answer that question, we’ve developed a new Debriefing Facilitation Guide which we are open-sourcing and publishing.

We wrote this document for two reasons.

The first is to state emphatically that we believe a “post-mortem” debriefing should be considered first and foremost a learning opportunity, not a fixing one. All too often, when teams get together to discuss an event, they walk into the room with a story they’ve already rationalized about what happened. This urge to point to a “root” cause is seductive — it allows us to believe that we’ve understood the event well enough, and can move on towards fixing things.

We believe this view is insufficient at best, and harmful at worst, and that gathering multiple diverse perspectives provides valuable context, without which you are only creating an illusion of understanding. Systems safety researcher Nancy Leveson, in her excellent book Engineering A Safer World, has this to say on the topic:

“An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate events: A narrow focus on operator actions, physical component failures, and technology may lead to ignoring some of the most important factors in terms of preventing future accidents. The whole concept of ‘root cause’ needs to be reconsidered.”

In other words, if we don’t pay attention to where and how we “look” to understand an event by considering a debriefing a true exploration with open minds, we can easily miss out on truly valuable understanding. How and where we pay attention to this learning opportunity begins with the debriefing facilitator.

The second reason is to help develop debriefing facilitation skills in our field. We wanted to provide practical guidance for debriefing facilitators as they think about preparing for, conducting, and navigating a post-event debriefing. We believe that organizational learning can only happen when objective data about the event (the type of data that you might put into a template or form) is placed into context with the subjective data that can only be gleaned by the skillful constructing of dialogue with the multiple, diverse perspectives in the room.

The Questions Are As Important As The Answers

In his book “Pre-Accident Investigations: Better Questions,” Todd Conklin sheds light on the idea that the focus on understanding complex work is not in the answers we assert, but in the questions we ask:

“The skill is not in knowing the right answer. Right answers are pretty easy. The skill is in asking the right question. The question is everything.”

What we learn from an event depends on the questions we ask as facilitators, not just the objective data you gather and put into a document. It is very easy to assume that the narrative of an accident can be drawn up from one person’s singular perspective, and that the challenge is what to do about it moving forward.

We do not believe that to be true. Here’s a narrative taken from real debriefing notes, generalized for this post:

“I don’t know,” the engineer said, when asked what happened. “I just wasn’t paying attention, I guess. This is on me. I’m sorry, everyone.” The outage had lasted only 9 minutes, but to the engineer it felt like a lifetime. The group felt a strange but comforting relief, ready to fill in the incident report with ‘engineer error’ and continue on with their morning work.

The facilitator was not ready to call it a ‘closed case.’

“Take us back to before you deployed, Phil…what do you remember? This looks like the change you prepped to deploy…” The facilitator displayed the code diff on the big monitor in the conference room.

Phil looked closely to the red and green lines on the screen and replied “Yep, that’s it. I asked Steve and Lisa for a code review, and they both said it looked good.” Steve and Lisa nod their heads, sheepishly.

“So after you got the thumbs-up from Steve and Lisa…what happened next?” the facilitator continued.

“Well, I checked it in, like I always do,” Phil replied. “The tests automatically run, so I waited for them to finish.” He paused for a moment. “I looked on the page that shows the test results, like this…” Phil brought up the page in his browser, on the large screen.

“Is that a new dashboard?” Lisa asked from the back of the room.

“Yeah, after we upgraded the Jenkins install, we redesigned the default test results page to the previous colors because the new one was hard to read,” replied Sam, the team lead for the automated testing team.

“The page says eight tests failed.” Lisa replied. Everyone in the room squinted.

“No, it says zero tests failed, see…?” Phil said, moving closer to the monitor.

Phil hit control-+ on his laptop, increasing the size of the text on the screen. “Oh my. I swear that it said zero tests failed when I deployed.”

The facilitator looked at the rest of the group in the conference room. “How many folks in the room saw this number as a zero when Phil first put it up on the screen?” Most of the group’s hands went up. Lisa smiled.

“It looked like a zero to me too,” the facilitator said.

“Huh. I think because this small font puts slashes in its zeros and it’s in italics, an eight looks a lot like a zero,” Sam said, taking notes. “We should change that.”

As a facilitator, it would be easy to stop asking questions at the mea culpa given by Phil. Without asking him to describe how he normally does his work, by bringing us back to what he was doing at the time, what he was focused on, what led him to believe that deploying was going to happen without issue, we might never have considered that the automated test results page could use some design changes to make the number of failed tests clearer, visually.

In another case, an outage involved a very complicated set of non-obvious interactions between multiple components. During the debriefing, the facilitator asked each of the engineers who designed the systems and were familiar with the architecture to draw on a whiteboard the picture that they had in their minds when they think about how it all is expected to work.

When seen together, each of the diagrams from the different engineers painted a fuller picture of how the components worked together than if there was only one engineer attempting to draw out a “comprehensive” and “canonical” diagram.

The process of drawing this diagram together also brought the engineers to say aloud what they were and were not sure about, and that enabled others in the room to repair those uncertainties and misunderstandings.

Both of these cases support the perspective we take at Etsy, which is that debriefings can (and should!) serve multiple purposes, not just a simplistic hunt for remediation items.

By placing the focus explicitly on learning first, a well-run debriefing has the potential to educate the organization on not just what didn’t work well on the day of the event in question, but a whole lot more.

The key, then, to having well-run debriefings is to start treating facilitation as a skill in which to invest time and effort.

Influencing the Industry

We believe that publishing this guide will help other organizations think about the way that they train and nurture the skill of debriefing facilitation. Moreover, we also hope that it can continue the greater dialogue in the industry about learning from accidents in software-rich environments.

In 2014, knowing how critical I believe this topic to be, someone alerted me to the United States Digital Service Play Books repo on Github, and an issue opened about incident review and post-mortems. I commented there that this was relevant to our interests at Etsy, and that at some point I would reach back out with work we’ve done in this area.

It took two years, but now we’re following up and hoping to reopen the dialogue.

Our Debriefing Facilitation Guide repo is here, and we hope that it is useful for other organizations.