Deploying to Google Kubernetes Engine

Posted by on June 5, 2018 / 2 Comments

Late last year, Etsy announced that we’ll be migrating our services out of self-managed data centers and into the cloud. We selected Google Cloud Platform (GCP) as our cloud provider and have been working diligently to migrate our services. Safely and securely migrating services to the cloud requires them to live in two places at once (on-premises and in the cloud) for some period of time.

In this article, I’ll describe our strategy specifically for deploying to a pair of Kubernetes clusters: one running in the Google Kubernetes Engine (GKE) and the other on-premises in our data center. We’ll see how Etsy uses Jenkins to do secure Kubernetes deploys using authentication tokens and GCP service accounts. We’ll learn about the challenge of granting fine-grained GKE access to your service accounts and how Etsy solves this problem using Terraform and Helm.

Deploying to On-Premises Kubernetes

Etsy, while new to the Google Cloud Platform, is no stranger to Kubernetes. We have been running our own Kubernetes cluster inside our data center for well over a year now, so we already have a partial solution for deploying to GKE, given that we have a system for deploying to our on-premises Kubernetes.

Our existing deployment system is quite simple from the perspective of the developer currently trying to deploy: simply open up Deployinator and press a series of buttons! Each button is labeled with its associated deploy action, such as “build and test” or “deploy to staging environment.”

Under the hood, each button is performing some action, such as calling out to a bash script or kicking off a Jenkins integration test, or some combination of several such actions.

For example, the Kubernetes portion of a Search deploy calls out to a Jenkins pipeline, which subsequently calls out to a bash script to perform a series of “docker build”, “docker tag”, “docker push”, and “kubectl apply” steps.

Why Jenkins, then? Couldn’t we perform the docker/kubectl actions directly from Deployinator?

The key is in… the keys! In order to deploy to our on-premises Kubernetes cluster, we need a secret access token. We load the token into Jenkins as a “credential” such that it is stored securely (not visible to Jenkins users), but we can easily access it from inside Jenkins code.

Now, deploying to Kubernetes is a simple matter of looking up our secret token via Jenkins credentials and overriding the “kubectl” command to always use the token.

Our Jenkinsfile for deploying search services looks something like this:

All of the deploy.sh scripts above use environment variable $KUBECTL in place of standard calls to kubectl, and so by wrapping everything in our withKubernetesEnvs closure, we have ensured that all kubectl actions are using our secret token to authenticate with Kubernetes.

Declarative Infrastructure via Terraform

Deploying to GKE is a little different than deploying to our on-premises Kubernetes cluster and one of the major reasons is our requirement that everything in GCP be provisioned via Terraform. We want to be able to declare each GCP project and all its resources in one place so that it is automatable and reproducible. We want it to be easy—almost trivial—to recreate our entire GCP setup again from scratch. Terraform allows us to do just that.

We use Terraform to declare every possible aspect of our GCP infrastructure. Keyword: possible. While Terraform can create our GKE clusters for us, it cannot (currently) create certain types of resources inside of those clusters. This includes Kubernetes resources which might be considered fundamental parts of the cluster’s infrastructure, such as roles and rolebindings.

Access Control via Service Accounts

Among the objects that are currently Terraformable: GCP service accounts! A service account is a special type of Google account which can be granted permissions like any other user, but is not mapped to an actual user. We typically use these “robot accounts” to grant permissions to a service so that it doesn’t have to run as any particular user (or as root!).

At Etsy, we already have “robot deployer” user accounts for building and deploying services to our data center. Now we need a GCP service account which can act in the same capacity.

Unfortunately, GCP service accounts only (currently) provide us with the ability to grant complete read/write access to all GKE clusters within the same project. We’d like to avoid that! We want to grant our deployer only the permissions that it needs to perform the deploy to a single cluster. For example, a deployer doesn’t need the ability to delete Kubernetes services—only to create or update them.

Kubernetes provides the ability to grant more fine-grained permissions via role-based access control (RBAC). But how do we grant that kind of permission to a GCP service account?

We start by giving the service account very minimal read-only access to the cluster. The service account section of the Terraform configuration for the search cluster looks like this:

We have now created a service account with read-only access to the GKE cluster. Now how do we associate it with the more advanced RBAC inside GKE? We need some way to grant additional permissions to our deployer by using a RoleBinding to associate the service account with a specific Role or ClusterRole.

Solving RBAC with Helm

While Terraform can’t (yet) create the RBAC Kubernetes objects inside our GKE cluster, it can be configured to call a script (either locally or remotely) after a resource is created.

Problem solved! We can have Terraform create our GKE cluster and the minimal deployer service account, then simply call a bash script which creates all the Namespaces, ClusterRoles, and RoleBindings we need inside that cluster. We can bind a role using the service account’s email address, thus mapping the service account to the desired GKE RBAC role.

However, as Etsy has multiple GKE clusters which all require very similar sets of objects to be created, I think we can do better. In particular, each cluster will require service accounts with various types of roles, such as “cluster admin” or “deployer”. If we want to add or remove a permission from the deployer accounts across all clusters, we’d prefer to do so by making the change in one place, rather than modifying multiple scripts for each cluster.

Good news: there is already a powerful open source tool for templating Kubernetes objects! Helm is a project designed to manage configured packages of Kubernetes resources called “charts”.

We created a Helm chart and wrote templates for all of the common resources that we need inside GKE. For each GKE cluster, we have a yaml file which declares the specific configuration for that cluster using the Helm chart’s templates.

For example, here is the yaml configuration file for our production search cluster:

And here are the templates for some of the resources used by the search cluster, as declared in the yaml file above (or by nested references inside other templates)…

When we are ready to apply a change to the Helm chart—or Terraform is applying the chart to an updated GKE cluster—the script which applies the configuration to the GKE cluster does a simple “helm upgrade” to apply the new template values (and only the new values! Helm won’t do anything where it detects that no changes are needed).

Integrating our New System into the Pipeline

Now that we have created a service account which has exactly the permissions we require to deploy to GKE, we only have to make a few simple changes to our Jenkinsfile in order to put our new system to use.

Recall that we had previously wrapped all our on-premises Kubernetes deployment scripts in a closure which ensured that all kubectl commands use our on-premises cluster token. For GKE, we use the same closure-wrapping style, but instead of overriding kubectl to use a token, we give it a special kube config which has been authenticated with the GKE cluster using our new deployer service account. As with our secret on-premises cluster token, we can store our GCP service account key in Jenkins as a credential and then access it using Jenkins’ withCredentials function.

Here is our modified Jenkinsfile for deploying search services:

And there you have it, folks! A Jenkins deployment pipeline which can simultaneously deploy services to our on-premises Kubernetes cluster and to our new GKE cluster by associating a GCP service account with GKE RBAC roles.

Migrating a service from on-premises Kubernetes to GKE is now (in simple cases) as easy as shuffling a few lines in the Jenkinsfile. Typically we would deploy the service to both clusters for a period of time and send a percentage of traffic to the new GKE version of the service under an A/B test. After concluding that the new service is good and stable, we can stop deploying it on-premises, although it’s trivial to switch back in an emergency.

Best of all: absolutely nothing has changed from the perspective of the average developer looking to deploy their code. The new logic for deploying to GKE remains hidden behind the Deployinator UI and they press the same series of buttons as always.

Thanks to Ben Burry, Jim Gedarovich, and Mike Adler who formulated and developed the Helm-RBAC solution with me.

2 Comments

The EventHorizon Saga

Posted by on May 29, 2018 / No Responses

This is an epic tale of EventHorizon, and how we finally got it to a happy place.

EventHorizon is a tool we use to watch events streaming into our system. Events (also known as beacons) are basically clickstream data—a record of actions visitors take on our site, what content they saw, what experiments they were bucketed into, etc.

Events are sent primarily from our web & API servers (backend events) and web browsers (frontend events), and logged in Kafka. EventHorizon is primarily used as a debugging tool, to make sure that a new event you’ve added is working as expected, but also serves to monitor the health of our event system.

Screenshot of the EventHorizon web page, showing events streaming in

EventHorizon UI

EventHorizon is pretty simple; it’s only around 200 lines of Go code. It consumes messages from our main event topic on Kafka (“beacon-main”) and forwards them via WebSockets to any connected browsers. Ideally, the time between when an event is fired on the web site and when it appears in the EventHorizon UI is on the order of milliseconds.

EventHorizon has been around for years, and in the early days, everything was hunky-dory. But then, the lagging started.

Nobody Likes a Laggard

As with most services at Etsy, we have lots of metrics we monitor for EventHorizon. One of the critical ones is consumer lag, which is the age of the last message we’ve read from the beacon-main Kafka topic. This would normally be milliseconds, but occasionally it would start lagging into minutes or even hours.

Graph showing EventHorizon consumer lag increasing from zero to over 1.7 hours, then back down again

EventHorizon Consumer Lag

Sometimes it would recover on its own, but if not, restarting EventHorizon would often fix the problem, but only temporarily. Within anywhere from a few hours to a few weeks we’d notice lag time beginning to grow again.

We spent a lot of time pouring over EventHorizon’s code, thinking we had a bug somewhere. It makes use of Go’s channels—perhaps there was a subtle concurrency bug that was causing a deadlock? We fiddled with that, but it didn’t help.

We noticed that we could sometimes trigger the lag if two users connected to EventHorizon at the same time. This clue led us to think that there was a bug somewhere in the code that sent events to the browsers. Something with Websockets? We considered rewriting it to use Server-sent Events, but never got around to that.

We also wondered if the sheer quantity of events we were sending to browsers was causing the problem. We updated EventHorizon to only send events from Etsy employees to browsers in production. Alas, the lag didn’t go away—although seemed to have gotten a little better.

We eventually moved onto other things. We set up a Nagios alert for when EventHorizon started lagging, and just restarted it when it got bad. Since it would often be fine for 2-3 weeks before lagging, spending more time trying to fix it wasn’t a top priority.

Orloj Joins the Fray

In September 2017 EventHorizon lag had gotten really bad. We would restart it and it would just start lagging again immediately. At some point we even turned off the Nagios alert.

However, another system, Orloj (pronounced “OR-loy”, named after the Prague astronomical clock), had started experiencing lag as well. Orloj is another Kafka consumer, responsible for updating the Recently Viewed Listings that are shown on the homepage when are you are signed in. As Orloj is a production system, figuring out what was happening became much more urgent.

Orloj’s lag was a little different: lag would spike once an hour, whenever the Hadoop job that pulls beacons down from Kafka into HDFS ran, and at certain times of the day it would be quite significant.

Graph showing the Orloj service lagging periodically

Orloj Periodic Lag

It turned out that due to a misconfiguration, KafkaPullJob, which was only supposed to launch one mapper per Kafka partition (of which we have 144 for beacon-main), was actually launching 400 mappers, which was swamping the network. We fixed this, and Orloj was happy again.

For about a week.

Trouble with NICs

Orloj continued to have issues with lag. While digging into this, I realized that the machines in the Kafka clusters only had 1G network interfaces (NICs), whereas 10G NICs were standard in most of our infrastructure. I talked to our networking operations team to ask about upgrading the cluster and one of the engineers asked what was going on with one particular machine, kafkautil01. The network graph showed that its bandwidth was pegged at 100%, and had been for a while. kafkautil01 also had a 1G NIC. And that’s where EventHorizon ran.

A light bulb exploded over my head.

Relaying this info to Kevin Gessner, the engineer who wrote Orloj, he said “Oh yeah, consuming beacon-main requires at least 1.5 Gbps.” Suddenly it all made sense.

Beacon traffic fluctuates in proportion to Etsy site traffic, which is cyclical. Parts of the day were under 1 Gbps, parts over, and when it went over, EventHorizon couldn’t keep up and would start lagging. And we were going over more and more often as Etsy grew.

And remember the bit about two browsers connecting at once triggering lag? With EventHorizon forwarding the firehose of events to each browser, that was also a good way to push the network bandwidth over 1 Gbps, triggering lag.

We upgraded the Kafka clusters and the KafkaUtil boxes to 10G NICs and everything was fixed. No more lag!

Ha, just kidding.

Exploding Events

We did think it was fixed for a while, but EventHorizon and Orloj would both occasionally lag a bit, and it seemed to be happening more frequently.

While digging into the continuing lag, we discovered that the size of events had grown considerably. Looking at a graph of event sizes, there was a noticeable uptick around the end of August.

Graph showing the average size of event beacons increasing from around 5K to 7K in the period of a couple months

Event Beacon Size Increase

This tied into problems we were having with capacity in our Hadoop cluster—larger events mean longer processing times for nearly every job.

Inspecting event sizes showed some clear standouts. Four search events were responsible for a significant portion of all event bandwidth. The events were on the order of 50KB each, about 10x the size of a “normal” event. The culprit was some debugging information that had been added to the events.

The problem was compounded by something that has been part of our event pipeline since the beginning: we generate complementary frontend events for each backend primary event (a “primary event” is akin to a page view) to capture browser-specific data that is only available on the frontend, and we do it by first making a copy of the entire event and then adding the frontend attributes. Later, when we added events for tracking page performance metrics, we did the same thing. These complementary events don’t need all the custom attributes of the original, so this is a lot of wasted bandwidth. So we stopped doing that.

Between slimming down the search events, not copying attributes unnecessarily, and finding a few more events that could be trimmed down, we managed to bring down the average event size, as well as event volume, considerably.

Graph showing the event beacon message size dropping from 7K to 3.5K in the space of a week

Event Beacon Size Decrease

Nevertheless, the lag persisted.

The Mysterious Partition 20

Orloj was still having problems, but this time it was a little different. The lag seemed to be happening only on a single partition, 20. We looked to see if the broker that was the leader for that partition was having any problems, and couldn’t see anything. We did see that it was serving a bit more traffic than other brokers, though.

The first thing that came to mind was a hot key. Beacons are partitioned by a randomly-generated string called a “browser_id” that is unique to a client (browser, native device, etc.) hitting our site. If there’s no browser_id, as is the case with internal API calls, it gets assigned to a random partition.

I used a command-line Kafka consumer to try to diagnose. It has an option for only reading from a single partition. Here I sampled 100,000 events from partitions 20 and 19:

Partition 20

$ go run cmds/consumer/consumer.go -ini-files config.ini,config-prod.ini -topic beacon-main -partition 20 -value-only -max 100000 | jq -r '[.browser_id[0:6],.user_agent] | @tsv' | sort | uniq -c | sort -rn | head -5
    558 orcIq5  Dalvik/2.1.0 (Linux; U; Android 7.0; SAMSUNG-SM-G935A Build/NRD90M) Mobile/1 EtsyInc/4.77.0 Android/1
    540 null    Api_Client_V3/Bespoke_Member_Neu_Orchestrator
    400 ArDkKf  Dalvik/2.1.0 (Linux; U; Android 8.0.0; Pixel XL Build/OPR3.170623.008) Mobile/1 EtsyInc/4.78.1 Android/1
    367 hK8GHc  Dalvik/2.1.0 (Linux; U; Android 7.0; SM-G950U Build/NRD90M) Mobile/1 EtsyInc/4.75.0 Android/1
    366 EYuogd  Dalvik/2.1.0 (Linux; U; Android 7.0; SM-G930V Build/NRD90M) Mobile/1 EtsyInc/4.77.0 Android/1

Partition 19

$ go run cmds/consumer/consumer.go -ini-files config.ini,config-prod.ini -topic beacon-main -partition 19 -value-only -max 100000 | jq -r '[.browser_id[0:6],.user_agent] | @tsv' | sort | uniq -c | sort -rn | head -5
    570 null    Api_Client_V3/Bespoke_Member_Neu_Orchestrator
    506 SkHj7N  Dalvik/2.1.0 (Linux; U; Android 7.0; LG-LS993 Build/NRD90U) Mobile/1 EtsyInc/4.78.1 Android/1
    421 Jc36zw  Dalvik/2.1.0 (Linux; U; Android 7.0; SM-G930V Build/NRD90M) Mobile/1 EtsyInc/4.78.1 Android/1
    390 A586SI  Dalvik/2.1.0 (Linux; U; Android 8.0.0; Pixel Build/OPR3.170623.008) Mobile/1 EtsyInc/4.78.1 Android/1
    385 _rD1Uj  Dalvik/2.1.0 (Linux; U; Android 7.0; SM-G935P Build/NRD90M) Mobile/1 EtsyInc/4.77.0 Android/1

I couldn’t see any pattern, but did notice we were getting a lot of events from the API with a null browser_id. These appeared to be distributed evenly across partitions, though.

We were seeing odd drops and spikes in the number of events going to partition 20, so I thought I’d see if I could just dump events around that time, so I started digging into our beacon consumer command-line tool to try to do that. In this process, I came across the big discovery: the -partition flag I had been relying on wasn’t actually hooked up to anything. So I was never consuming from a particular partition, but from all partitions. Once I fixed this, the problem was obvious:

Partition 20

$ go run cmds/consumer/consumer.go -ini-files config.ini,config-prod.ini -topic beacon-main -q -value-only -max 10000 -partition 20 | jq -r '[.browser_id[0:6],.user_agent] | @tsv' | sort | uniq -c | sort -nr | head -5
   8268 null    Api_Client_V3/Bespoke_Member_Neu_Orchestrator
    335 null    Api_Client_V3/BespokeEtsyApps_Public_Listings_Offerings_FindByVariations
    137 B70AD9  Mozilla/5.0 (iPhone; CPU iPhone OS 11_1_1 like Mac OS X) AppleWebKit/604.3.5 (KHTML, like Gecko) Mobile/15B150 EtsyInc/4.78 rv:47800.37.0
     95 C23BB0  Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 EtsyInc/4.78 rv:47800.37.0
     83 null    Api_Client_V3/Member_Carts_ApplyBestPromotions

and another partition for comparison:

$ go run cmds/consumer/consumer.go -ini-files config.ini,config-prod.ini -topic beacon-main -q -value-only -max 10000 -partition 0 | jq -r '[.browser_id[0:6],.user_agent] | @tsv' | sort | uniq -c | sort -nr | head -5
   1074 dtdTyz  Dalvik/2.1.0 (Linux; U; Android 7.0; VS987 Build/NRD90U) Mobile/1 EtsyInc/4.78.1 Android/1
    858 gFXUcb  Dalvik/2.1.0 (Linux; U; Android 7.0; XT1585 Build/NCK25.118-10.2) Mobile/1 EtsyInc/4.78.1 Android/1
    281 C380E3  Mozilla/5.0 (iPhone; CPU iPhone OS 11_0_3 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Mobile/15A432 EtsyInc/4.77 rv:47700.64.0
    245 E0464A  Mozilla/5.0 (iPhone; CPU iPhone OS 11_0_3 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Mobile/15A432 EtsyInc/4.78 rv:47800.37.0
    235 BAA599  Mozilla/5.0 (iPhone; CPU iPhone OS 11_0_3 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Mobile/15A432 EtsyInc/4.78 rv:47800.37.0

All the null browser_ids were going to partition 20. But how could this be? They’re supposed to be random.

I bet a number of you are slapping your foreheads right now, just like I did. I wrote this test of the hashing algorithm: https://play.golang.org/p/MJpmoPATvO

Yes, the browser_ids I was thinking were null were actually the string “null”, which is what was getting sent in the JSON event. I put in a fix for this, and:

Graph showing the number of events per second being processed by Orloj by partition, with partition 20 dropping significantly

Orloj Partition 20 Baumgartner

Lessons

I’m not going to attempt to draw any deep conclusions from this sordid affair, but I’ll close with some advice for when you find yourself with a persistent confounding problem like this one.

Graph everything. Then add some more graphs. Think about what new metrics might be helpful in diagnosing your issue. Kevin added the above graph showing Orloj events/sec by partition, which was critical to realizing there was a hot key issue.

If something makes no sense, think about what assumptions you’ve been making in diagnosing it, and check those assumptions. Kevin’s graph didn’t line up with what I was seeing, so I dug deeper into the command-line consumer and found the problem with the -partition flag.

Talk to people. Almost every small victory along the way came after talking with someone about the problem, and getting some crucial insight and perspective I was missing.

Keep at it. As much as it can seem otherwise at times, computers are deterministic, and with persistence, smart colleagues, and maybe a bit of luck, you’ll figure it out.

Epilogue

On November 13 Cloudflare published a blog post on tuning garbage collection in Go. EventHorizon and Orloj both spent a considerable percentage of time (nearly 10%) doing garbage collection. By upping the GC threshold for both, we saw a massive performance improvement:

Graph showing the GC pause time per sec dropping from 75+ ms to 592 µs

Graph showing the message processing time dropping significantly

Graph showing the EventHorizon consumer lag dropping from 25-125ms to near zero

Except for a couple brief spikes during our last Kafka upgrade, EventHorizon hasn’t lagged for more than a second since that change, and the current average lag is 2 ms.

Thanks to Doug Hudson, Kevin Gessner, Patrick Cousins, Rebecca Sliter, Russ Taylor, and Sarah Marx for their feedback on this post. You can follow me on Twitter at @bgreenlee.

No Comments

Sealed classes opened my mind

Posted by on April 12, 2018 / 3 Comments

How we use Kotlin to tame state at Etsy

Bubbly the Baby Seal by SmartappleCreations

Bubbly the Baby Seal by SmartappleCreations

Etsy’s apps have a lot of state.  Listings, tags, attributes, materials, variations, production partners, shipping profiles, payment methods, conversations, messages, snippets, lions, tigers, and even stuffed bears.  

All of these data points form a matrix of possibilities that we need to track.  If we get things wrong, then our buyers might fail to find a cat bed they wanted for Whiskers.  And that’s a world I don’t want to live in.

There are numerous techniques out there that help manage state.  Some are quite good, but none of them fundamentally changed our world quite like Kotlin’s sealed classes.  As a bonus, most any state management architecture can leverage the safety of Kotlin’s sealed classes.

What are sealed classes?

The simplest definition for sealed class is that they are “like enums with actual types.”  Inspired by Scala’s sealed types, Kotlin’s official documentation describes them thus:

Sealed classes are used for representing restricted class hierarchies, when a value can have one of the types from a limited set, but cannot have any other type. They are, in a sense, an extension of enum classes: the set of values for an enum type is also restricted, but each enum constant exists only as a single instance, whereas a subclass of a sealed class can have multiple instances which can contain state.

A precise, if a bit wordy, definition.  Most importantly: they are restricted hierarchies, they represent a set, they are types, and they can contain values.  

Let’s say we have a class Result to represent the value returned from a network call to an API.  In our simplified example, we have two possibilities: a list of strings, or an error.

 

Now we can use Kotlin’s when expression to handle the result. The when expression is a bit like pattern matching. While not as powerful as pattern matching in some other languages, Kotlin’s when expression is one of the language’s most important features.  

Let’s try parsing the result based on whether it is was a success or an error.

 

We’ve done a lot here in a just a few lines so let’s go over each aspect.  First we were able to check the type of result using Kotlin’s is operator, which is equivalent to Java’s instanceof operator.  By checking the type, Kotlin was able to smartcast the value of result for us for each case.  So if result is a success, we can access the value as if it is typed Result.Success.  Now we can can pass items without any type casting to showItems(items: List<String>).  If the result was an error, just print the stack trace to the console.  

The next thing we did with the when expression is to exhaust all the possibilities for the Result sealed class type.  Typically a when expression must have an else clause.  However, in the above example, there are no other possible types for Result, so the compiler, and IDE, know we don’t need an else clause.  This is important as it is exactly the kind of safety we’ve been longing for.  Look at what happens when we add a Cancelled type to the sealed class:

 

Now the IDE would show an error on our when statement because we haven’t handled the Cancelled branch of Result.

‘when’ expression must be exhaustive, add necessary ‘is Cancelled’ branch or ‘else’ branch instead.

 

The IDE knows we didn’t cover all our bases here.  It even knows which possible types result could be based on the Result sealed class.  Helpfully, the IDE offers us a quickfix to add the missing branches.  

There is one subtle difference here though.  Notice we assigned the when expression to a val. The requirement that a when expression must be exhaustive only kicks in if you use when as a value, a return type, or call a function on it.  This is important to watch out for as it can be a bit of a gotcha.  If you want to utilize the power of sealed classes and when, be sure to use the when expression in some way.

So that’s the basic power of sealed classes, but let’s dig a little deeper and see what else they can do.  

Weaving state

Let’s take a new example of a sealed class Yarn.

 

Then let’s create a new class call Merino and have it extend from Wool

 

What would be the effect of this on our when expression if we were branching on Yarn?  Have we actually created a new possibility that we must have a branch to accommodate?  In this case we have not, because a Merino is still considered part of Wool.  There are still only two branches for Yarn:

 

So that didn’t really help us represent the hierarchy of wool properly.  But there is something else we can do instead. Let’s expand the example of Wool to include Merino and Alpaca types of wool and make Wool into a sealed class.

 

Now our when expression will see each Wool subclass as unique branches that must be exhausted.

 

Only wool socks please

As our state grows in complexity however, there are times in which we may only be concerned about a subset of that state.  In Android it is common to have a custom view that perhaps only cares about the loading state. What can we do about that? If we use the when expression on Yarn, don’t we need to handle all cases?  Luckily Kotlin’s stdlib comes to the rescue with a helpful extension function.  

Say we are processing a sequence of events. Let’s create a fake sequence of all our possible states.

 

Now let’s say that we’re getting ready to knit a really warm sweater. Maybe our KnitView is only interested in the Wool.Merino and Wool.Alpaca states. How do we handle only those branches in our when expression?  Kotlin’s Iterable extension function filterIsInstance to the rescue!  Thankfully we can filter the tree with a simple single line of code.

 

And now, like magic, our when expression only needs to handle Wool states.  So now if we want to iterate through the sequence we can simply write the following:

 

Meanwhile at Etsy

Like a lot of apps these days we support letting our buyers and sellers login via Facebook or Google in addition to email and password.  To help protect the security of our buyers and sellers, we also offer the option of Two Factor Authentication. Adding additional complexity, a lot of these steps have to pause and wait for user input.  Let’s look at the diagram of the flow:

So how do we model this with sealed classes?  There are 3 places where we make a call out to the network/API and await a result.  This is a logical place to start. We can model the responses as sealed classes.

First, we reach out to the server for the social sign in request itself:

 

Next, is the request to actually sign in:

 

Finally, we have the two factor authentication for those that have enabled it:

 

This is a good start but let’s look at what we have here.  The first thing we notice is that some of the states are the same.  This is when we must resist our urges and instincts to combine them.  While they look the same, they represent discrete states that a user can be in at different times. It’s safer, and more correct to think of each state as different.  We also want to prefer duplication over the wrong abstraction.

The next thing we notice here is that these states are actually all connected. In some ways we’re starting to approach a Finite State Machine (FSM).  While an FSM here might make sense, what we’ve really done is to define a very readable, safe way of modeling the state and that model could be used with any type of architecture we want. 

The above sealed classes are logical and match our 3 distinct API requests — ultimately however, this is an imperfect representation.  In reality, there should be multiple instances of SignInResult for each branch of  SocialSignInResult,  and multiple instances of TwoFactorResult for each branch of SocialSignInResult.

For us three steps proved to be the right level of abstraction for the refactoring effort we were undertaking at the time.  In the future we may very well connect each and every branch in a single sealed class hierarchy. Instead of three separate classes, we’d use a single class that represented the complete set of possible states.  

 

Let’s take a look at what that would have looked like:

Note: to keep things simple I omitted the parameters  and used only regular subclasses

 

Our sealed class is now beginning to look a bit more busy, but we have successfully modeled all the possible states for an Etsy buyer or seller during login.  And now we have a discrete set of types to represent each state, and a type safe way to filter them using Iterable.filterIsInstance<Type>(iterable).

This is pretty powerful stuff.  As you can imagine we could connect a custom view to a stream of events that filtered only on 2FA states.  Or we could connect a progress “spinner” that would hide itself on some other subset of the states.

These ways of representing state opened up a clean way to react with business logic or changes on the UI.  Moreover, we’ve done it in a type-safe, expressive, and fluent way.

So with a little luck, now you can login and get that cat bed for Whiskers!

Special thanks to Cameron Ketcham who wrote most of this code!

3 Comments

Culture of Quality: Measuring Code Coverage at Etsy

Posted by on February 15, 2018 / 5 Comments

In the summer of 2017, Etsy created the Test Engineering Strategy Team to define, measure, and develop practices that would help product engineering teams ship quality code. One of our first goals was to find a way to establish a quantifiable quality “baseline” and track our progress against this baseline over time. We settled on code coverage because it provides both a single percentage number as well as line-by-line detailed information for engineers to act on. With over 30,000 unit tests in our test suite, this became a difficult challenge.

Code coverage is the measure of how much code is being executed by your test suite. This blog post will walk you through the process of implementing coverage across our various platforms. We are not going to weigh in on the debate over the value of code coverage in this blog post.

We approached measuring code coverage with 5 requirements:

  1. Generate code coverage metrics by file for PHP, Android, and iOS code bases
  2. Where possible, use existing tools to minimize impact on existing test and deploy infrastructure
  3. Create an automated and reliable hands-off process for measurement and reporting
  4. Surface an accurate measurement as a KPI for code owners
  5. Promote wide dissemination of the results to increase awareness

Setup and Setbacks

Our first step was collecting information on which tools were already being utilized in our deployment pipeline. A surprise discovery was that Android code coverage had already been implemented in our CI pipeline using JaCoCo, but had been disabled earlier in the year because it was impacting deploy times. We re-enabled the Jenkins job and set it to run on a weekly schedule instead of each deploy to the master trunk. Impact on deployment would be a recurring theme throughout coverage implementation.

There was some early concern about the accuracy of the coverage report generated by JaCoCo. We noticed discrepancies in the reports when viewed through the Jenkins JaCoCo plugin and the code coverage window in Android Studio. We already have a weekly “Testing Workshop” scheduled where these kinds of problems are discussed. Our solution was to sit down with engineers during the workshops and review the coverage reports for differences. A thorough review found no flaws in the reports, so we moved forward with relying on JaCoCo as our coverage generation tool.

For iOS, xcodebuild has built-in code coverage measurement. Setting up xcodebuild’s code coverage is a one-step process of enabling coverage in the target scheme using Xcode. This provides an immediate benefit to our engineers, as Xcode supplies instant feedback in the editor on which code is covered by tests. We measured the performance hit that our unit test build times would take from enabling coverage in CI and determined that it would be negligible. The jobs have a high variance build time with a mean(μ) swing of about 40 seconds. When jobs with coverage enabled were added to the sample, it did not have a noticeable effect on the build time trend. The trend is continually monitored and will be adjusted if the impact becomes detrimental to our deployment process.

Our PHP unit tests were being executed by PHPUnit, an industry standard test runner, which includes an option to run tests with coverage. This was a great start. Running the tests with coverage required us to enable Xdebug, which can severely affect the performance of PHP. The solution was to employ an existing set of containers created by our Test Deployment Infrastructure team for smaller tasks (updating dashboards, running build metric crons, etc.). By enabling Xdebug on this small subset of containers we could run coverage without affecting the entire deployment suite. After this setup was completed, we attempted to run our unit tests with coverage on a sample of tests. We quickly found that executing tests with coverage would utilize a lot of RAM, which would cause the jobs to fail. Our initial attempt to solve the problem was to modify the “memory_limit” directive in the PHP.ini file to grant more RAM to the process. We were allowing up to 1024mb of RAM for the process, but this was still unsuccessful. Our eventual solution was to prepend our shell execution with php -d "memory_limit=-1" to free up all available memory for the process.

Even after giving the process all of the memory available to the container, we were still running into job failures. Checking coverage for ~30,000 tests in a single execution was too problematic. We needed a way to break up our coverage tests into multiple executions.

Our PHP unit tests are organized by directory (cart, checkout, etc.). In order to break up the coverage job, we wrote a simple shell script that iterates over each directory and creates a coverage report.

for dir in */ ; do
if [[ -d "$dir" && ! -L "$dir" ]]; then
Coverage Command $dir;
fi;
done

Once we had the individual coverage reports, we needed a way to merge them. Thankfully, there is a second program called PHPCov that will combine reports generated by PHPunit. A job of this size can take upwards of four hours in total, so checking coverage on every deploy was out of the question. We set the job to run on a cron alongside our Android and iOS coverage jobs, establishing a pattern of checking code coverage on Sunday nights.

Make It Accessible

After we started gathering code coverage metrics, we needed to find a way to share this data with the owners of the code. For PHP, the combined report generated by PHPCov is a hefty 100mb XML file in the Clover format. We use a script written in PHP to parse the XML into an array and then output that array as a CSV file. With that done, we copy the data to our Vertica data warehouse for easy consumption by our engineering teams.

For iOS, our first solution was to use a ruby gem called Slather to generate HTML reports. We followed this path for many weeks and implemented automated Slather reporting using Jenkins. Initially this seemed like a viable solution, but when reviewed by our engineers we found some discrepancies between the coverage report generated by Slather and the coverage report generated by Xcode. We had to go back to the drawing board and find another solution. We found another ruby gem, xcov, that directly analyzed the .xccoverage file generated by xcodebuild when running unit tests. We could parse the results of this coverage report into a JSON data object, then convert it into CSV and upload it to Vertica.

For Android, the JaCoCo gradle plugin is supposed to be able to generate reports in multiple formats, including CSV. However, we could not find a working solution for generating reports using the gradle plugin. We spent a considerable amount of time debugging this problem and eventually realized that we were yak shaving. We discarded base assumptions and looked for other solutions. Eventually we decided to use the built-in REST api provided by Jenkins. We created a downstream job, passed it the build number for the JaCoCo report, then used a simple wget command to retrieve and save the JSON response. This was converted into a CSV file and (once again) uploaded to Vertica.

Once we had the coverage data flowing into Vertica we wanted to get it into the hands of our engineering teams. We used Superbit, our in-house business intelligence reporting tool, to make template queries and dashboards that provided examples of how to surface relevant information for each team. We also began sending a weekly email newsletter to the engineering and project organizations highlighting notable changes to our coverage reports.

To be continued…

Measuring code coverage is just one of many methods the Test Engineering Strategy Team is using to improve the culture of quality in Etsy Engineering. In a future blog post we will be discussing the other ways in which we measure our code base and how we use this reporting to gain confidence in the code that we ship.

5 Comments

Selecting a Cloud Provider

Posted by on January 4, 2018 / 13 Comments

Etsy.com and most of our related services have been hosted in self-managed data centers since the first Etsy site was launched in 2005. Earlier this year, we decided to evaluate migrating everything to a cloud hosting solution. The decision to run our own hardware in data centers was the right decision at the time, but infrastructure as a service (IaaS) and platform as a service (PaaS) offerings have changed dramatically in the intervening years. It was time to reevaluate our decisions. We recently announced that we have selected Google Cloud Platform (GCP) as our cloud provider and are incredibly excited about this decision. This marks a shift for Etsy from infrastructure self-reliance to a best-in-class service provider. This shift allows us to spend less time maintaining our own infrastructure and more time on strategic features and services that make the Etsy marketplace great.

Although we use the term ‘vendor’ when referring to a cloud provider, we viewed this as much more than a simple vendor selection process. We are entering into a partnership and a long-term relationship. The provider that we have chosen is going to be a major part of our successful initial migration, as well as a partner in the long-term scalability and availability of our site and services. This was not a decision that we wanted to enter into without careful consideration and deliberate analysis. This article will walk you through the process by which we vetted and ultimately selected a partner. We are not going to cover why we chose to migrate to a cloud hosting provider nor are we going to cover the business goals that we have established to measure the success of this project.

From One, Many

While the migration to a cloud hosting provider can be thought of as a single project, it really is one very large project made up of many smaller projects. In order to properly evaluate each cloud provider accurately, we needed to identify all of the sub-projects, determine the specific requirements of each sub-project, and then use these requirements to evaluate the various providers. Also, to scope the entire project, we needed to determine the sequence, effort, dependencies, priority, and timing of each sub-project.

We started by identifying eight major projects, including the production render path for the site, the site’s search services, the production support systems such as logging, and the Tier 1 business systems like Jira. We then divided these projects further into their component projects—MySQL and Memcached as part of the production render path, for example. By the end of this exercise, we had identified over 30 sub-projects. To determine the requirements for each of these sub-projects, we needed to gather expertise from around the organization. No one person or project team could accurately, or in a timely enough manner, gather all of these requirements. For example, we not only needed to know the latency tolerance of our MySQL databases but also our data warehousing requirement for an API to create and delete data. To help gather all of these requirements, we used a RACI model to identify subproject ownership.

RACI

The RACI model is used to identify the responsible, accountable, consulted, and informed people for each sub-project. We used the following definitions:

Each Responsible person owned the gathering of requirements and the mapping of dependencies for their sub-project. The accountable person ensured the responsible person had the time, resources, and information they needed to complete the project and ultimately signed off that it was done.

Architectural Review

Etsy has long used an architectural review process, whereby any significant change in our environment, whether a technology, architecture, or design, undergoes a peer review. As these require a significant contribution of time from senior engineers, the preparation for these is not taken lightly. Experts across the organization collaborated to solicit diverse viewpoints and frequently produced 30+ page documents for architectural review.

We determined that properly evaluating various cloud providers required an understanding of the desired end state of various parts of our system. For example, our current provisioning is done in our colocation centers using a custom toolset as automation to build bare-metal servers and virtual machines (VMs). We also use Chef roles and recipes applied on top of provisioned bare-metal servers and VMs. We identified a few key goals for choosing a tool for infrastructure creation in the cloud including: greater flexibility, accountability, security, and centralized access control. Our provisioning team evaluated several tools against these goals, discussed them, and then proposed new workflows in an architecture review. The team concluded by proposing we use Terraform in combination with Packer to build the base OS images.

Ultimately, we held 25 architectural reviews for major components of our system and environments. We also held eight additional workshops for certain components we felt required more in-depth review. In particular, we reviewed the backend systems involved in generating pages of etsy.com (a.k.a. the production render path) with a greater focus on latency constraints and failure modes. These architectural reviews and workshops resulted in a set of requirements that we could use to evaluate the different cloud providers.

How it Fits Together

Once we had a firm-ish set of requirements for the major components of the system, we began outlining the order of migration. In order to do this we needed to determine how the components were all interrelated. This required us to graph dependencies, which involved teams of engineers huddled around whiteboards discussing how systems and subsystems interact.

The dependency graphing exercises helped us identify and document all of the supporting parts of each major component, such as the scheduling and monitoring tools, caching pools, and streaming services. These sessions ultimately resulting in high-level estimates of project effort and timing, mapped in Gantt-style project plans.

Experimentation

Earlier in the year, we ran some of our Hadoop jobs on one of the cloud providers’ services, which gave us a very good understanding of the effort required to migrate and the challenges that we would face in doing so at scale. For this initial experiment, however, we did not use GCP, so we didn’t have the same level of understanding of the cloud provider we ultimately chose.

We therefore undertook an experiment to enable batch jobs to run on GCP utilizing Dataproc and Dataflow. We learned a number of lessons from this exercise including that some of the GCP services were still in alpha release and not suitable for the workloads and SLAs that we required. This was the first of many similar decisions we’ll need to make: use a cloud service or build our own tool. In this case, we opted to implement Airflow on GCP VMs. To help us make these decisions we evaluated the priority of various criteria, such as vendor support of the service, vendor independence, and impact of this decision on other teams.

There is no right answer to these questions. We believe that these questions and criteria need to be identified for each team to consider them and make the best decision possible. We are also not averse to revisiting these decisions in the future when we have more information or alpha and beta projects roll out into general availability (GA).

Meetings

Over the course of five months, we met with the Google team multiple times. Each of these meetings had clearly defined goals, ranging from short general introductions of team members to full day deep dives on various topics such as using container services in the cloud. In addition to providing key information, these meetings also reinforced a shared engineering culture between Etsy and Google.

We also met with reference customers to discuss what they learned migrating to the cloud and what to look out for on our journey. The time spent with these companies was unbelievably worthwhile and demonstrated the amazing open culture that is pervasive among technology companies in NY.

We also met with key stakeholders within Etsy to keep them informed of our progress and to invite their input on key directional decisions such as how to balance the mitigation of financial risk with the time-discounted value of cost commitments. Doing so provided decision makers with a shared familiarity and comfort with the process and eliminated the possibility of surprise at the final decision point.

The Decision

By this point we had thousands of data points from stakeholders, vendors, and engineering teams. We leveraged a tool called a decision matrix that is used to evaluate multiple-criteria decision problems. This tool helped organize and prioritize this data into something that could be used to impartially evaluate each vendor’s offering and proposal. Our decision matrix contained over 200 factors prioritized by 1,400 weights and evaluated more than 400 scores.

This process began with identifying the overarching functional requirements. We identified relationship, cost, ease of use, value-added services, security, locations, and connectivity as the seven top-level functional requirement areas. We then listed every one of the 200+ customer requirements (customers referring to engineering teams and stakeholders) and weighted them by how well each supported the overall functional requirements. We used a 0, 1, 3, 9 scale to indicate the level of support. For example, the customer requirement of “autoscaling support” was weighted as a 9 for cost (as it would help reduce cost by dynamically scaling our compute clusters up and down), a 9 for ease of use (as it would keep us from having to manually spin up and down VMs), a 3 for value-added services (as it is an additional service offered beyond just basic compute and storage but it isn’t terribly sophisticated), and a 0 for supporting other functional requirements. Multiple engineering teams performed an in-depth evaluation and weighting of these factors. Clear priorities began to emerge as a result of the nonlinear weighting scale, which forces overly conservative scorers to make tough decisions on what really matters.  

We then used these weighted requirements to rank each vendor’s offering. We again used a 0, 1, 3, 9 scale for how well each cloud vendor met the requirement. Continuing with our “autoscaling support” example, we scored each cloud vendor a 9 for meeting this requirement completely in that all the vendors we evaluated provided support for autoscaling compute resources as native functionality. The total scores for each vendor reached over 50,000 points with GCP exceeding the others by more than 10%.

As is likely obvious from the context, it should be noted that Etsy’s decision matrix (filled with Etsy’s requirements, ranked with Etsy’s weights by Etsy’s engineers) is applicable only to Etsy. We are not endorsing this decision as right for you or your organization, but rather we are attempting to provide you with insight into how we approach decisions such as these that are strategic and important to us.

Just the beginning

Now, the fun and work begin. This process took us the better part of five months with a full-time technical project manager, dozens of engineers and engineering managers, as well as several lawyers, finance personnel, and sourcing experts working on this part or in some cases full time. Needless to say, this was not an insignificant effort but really just the beginning of a multi-year project to migrate to GCP. We have an aggressive timeline to achieve our migration in about two years. We will do this while continuing to focus on innovative product features and minimizing risk during the transition period. We look forward to the opportunities that moving to GCP provides us and are particularly excited about this transformational change allowing us to focus more on core, strategic services for the Etsy marketplace by partnering with a best-in-class service provider.

13 Comments

VIPER on iOS at Etsy

Posted by on December 11, 2017 / 3 Comments

Background

Consistent design patterns help Etsy move fast and deliver better features to our sellers and buyers. Typically we built features in a variety of ways, but found great benefits from the consistency VIPER brings.

Inconsistent development practices can make jumping between projects and features fairly difficult. When starting a new project, developers would have to understand the way a feature was implemented with very little shared base knowledge.

Just before we began working on the new Shop Settings in the Sell on Etsy app, VIPER was gaining steam amongst the iOS community and given the complexity of Shop Settings we decided to give this new pattern a try.

While the learning curve was steep and the overhead was significant (more on that later) we appreciated the flexibility it had over other architectures allowing it to be more consistently used throughout our codebase. We decided to to apply the pattern elsewhere in our apps while investigating solutions to some of VIPER’s drawbacks.

What is VIPER?

VIPER is a Clean Architecture that aims to separate out the concerns of a complicated view controller into a set of objects with distinct responsibilities.

VIPER stands for View, Interactor, Presenter, Entity, and Router, each with its own responsibility:

View: What is displayed

Interactor: Handles the interactions for each use case.

Presenter: Sets up the logic for what is displayed in the view

Entity: Model objects that are consumed by the Interactor

Router: Handles the navigation/data passing from one screen to the next

Example Implementation

As a contrived example let’s say we want a view that displays a list of people. Our VIPER implementation could be setup as follows

View: A simple UITableView set up to display UITableViewCells

Interactor: The logic for fetching Person objects

Presenter: Takes a list of Person objects and returns a list of only the names for consumption by the view

Entity: Person objects

Router: Presents a detail screen about a given Person

The interaction would look something like this:

  1. The main view controller would tell the Interactor to fetch the Person entities
  2. The Interactor would fetch the Person entities and tell the Presenter to present these entities
  3. The Presenter would consume these entities and generate a list of strings
  4. The Presenter would then tell the UITableView to reload with the given set of strings

Additionally if the user were to tap on a given Person object the Interactor would call to the Router to present the detail screen.

Overhead & the issues with VIPER

VIPER is not without issues. As we began to develop new features we ran into a few stumbling blocks.

The boilerplate for setting up the interactions between all the classes can be quite tedious:


// The Router needs a reference to the interactor for 
// any possible delegation that needs to be set up.
// A good example would be after a save on an edit 
// screen the interactor should be told to reload the data.
router.interactor = interactor

// This is needed for the presentation logic
router.controller = navigationController

// If the view has any UI interaction the presenter 
// should be responsible for passing this to the interactor
presenter.interactor = interactor

// Some actions that the Interactor may do may require 
// presentation/navigation logic
interactor.router = router

// The interactor will need to update the presenter when 
// a UI update is needed
interactor.presenter = presenter

As a result of all the classes involved and the interaction between them developing a mental model of how VIPER works is difficult. Also, the numerous references required between classes can easily lead to retain cycles if weak references are not used in the appropriate places.

VIPERBuilder and beyond

In an attempt to reduce the amount of boilerplate and lower the barrier for using VIPER we developed VIPERBuilder.

VIPERBuilder consists of a simple set of base classes for the Interactor, Presenter and Router (IPR) as well as a Builder class.

By creating an instance of the VIPERBuilder object with your IPR subclasses specified via Swift generics (as needed) the necessary connections are made to allow for consistent VIPER implementation.

Example within a UIViewController:


lazy var viperBuilder: VIPERBuilder<NewInteractor, NewPresenter, NewRouter> = {
    return VIPERBuilder(controller: self)
}()

The VIPERBuilder classes intended to be subclassed are written in Objective C to allow subclassing in both Swift & Objective C. There is also an alternative builder object (VIPERBuilderObjC) that uses Objective C generics.

Since the boilerplate is fairly limited the only thing left to do is write your functionality and view code. The part of VIPERBuilder that we chose not to architect around is what code should actually go in the IPR subclasses. Instead we broke down our ideal use-cases for IPR subclasses into the Readme.

Additionally, we have an example project in the repo to give a better idea of how the separation of concerns work with VIPERBuilder.

The iOS team at Etsy has built many features with VIPER & VIPERBuilder over the past year: shop stats, multi-shop checkout, search and the new seller dashboard just to name a few. Following these guidelines in our codebase has lessened the impact of context switching between features implemented in VIPER. We are super excited to see how the community uses VIPERBuilder and how it changes over time.

3 Comments

How Etsy caches: hashing, Ketama, and cache smearing

Posted by on November 30, 2017

At Etsy, we rely heavily on memcached and Varnish as caching tiers to improve performance and reduce load. Database and search index query results, expensive calculations, and more are stored in memcached as a look-aside cache read from PHP and Java. Internal HTTP requests between API servers are sent through and cached in Varnish. As traffic has grown, so have our caching pools. Today, our caching clusters store terabytes of cached data and handle millions of lookups per second.

At this scale, a single cache host can’t handle the volume of reads and writes that we require.  Cache lookups come from thousands of hosts, and the total volume of lookups would saturate the network and CPU of even the beefiest hardware—not to mention the expense of a server with terabytes of RAM.  Instead, we scale our caches horizontally, by building a pool of servers, each running memcached or Varnish.

In memcached, the cache key can be (almost) any string, which will be looked up in the memcached server’s memory.  As an HTTP cache, Varnish is a bit more complex: requests for cacheable resources are sent directly to Varnish, which constructs a cache key from the HTTP request.  It uses the URL, query parameters, and certain headers as the key.  The headers to use in the key are specified with the Vary header; for instance, we include the user’s locale header in the cache key to make sure cached results have the correct language and currency.  If Varnish can’t find the matching value in its memory, it forwards the request to a downstream server, then caches and returns the response.

With large demands for both memcached and varnish, we want to make efficient use of the cache pool’s space.  A simple way to distribute your data across the cache pool is to hash the key to a number, then take it modulo the size of the cache pool to index to a specific server.  This makes efficient use of your cache space, because each cached value is stored on a single host—the same host is always calculated for each key.

Consistent hashing for scalability

A major drawback of modulo hashing is that the size of the cache pool needs to be stable over time.  Changing the size of the cache pool will cause most cache keys to hash to a new server.  Even though the values are still in the cache, if the key is distributed to a different server, the lookup will be a miss.  That makes changing the size of the cache pool—to make it larger or for maintenance—an expensive and inefficient operation, as performance will suffer under tons of spurious cache misses.

For instance, if you have a pool of 4 hosts, a key that hashes to 500 will be stored on pool member 500 % 4 == 0, while a key that hashes to 1299 will be stored on pool member 1299 % 4 == 3.  If you grow your cache by adding a fifth host, the cache pool calculated for each key may change. The key that hashed to 500 will still be found on pool member 500 % 5 == 0, but the key that hashed to 1299 be on pool member 1299 % 5 == 4. Until the new pool member is warmed up, your cache hit rate will suffer, as the cache data will suddenly be on the ‘wrong’ host. In some cases, pool changes can cause more than half of your cached data to be assigned to a different host, slashing the efficiency of the cache temporarily. In the case of going from 4 to 5 hosts, only 20% of cache keys will be on the same host as before!Illustration of key movement with modulo hashing during pool changes

A fixed-size pool, with changes causing the hit rate to suffer, isn’t acceptable. At Etsy, we make our memcached pools more flexible with consistent hashing, implemented with Ketama. Instead of using the modulo operation to map the hash output to the cache pool, Ketama divides the entire hash space into large contiguous buckets, one (or more) per cache pool host. Cache keys that hash to each range are looked up on the matching host.  This consistent hashing maps keys to hosts in such a way that changing the pool size only moves a small number of keys—“consistently” over time and pool changes.

As a simplified example, with 4 hosts and hash values between 0 and 2^32-1, Ketama might map all values in [0, 2^30) to pool member 0, [2^30, 2^31) to pool member 1, and so on. Like modulo, this distributes cache keys evenly and consistently among the cache hosts. But when the pool size changes, Ketama results in fewer keys moving to a different host.

In the case of going from four hosts to five, Ketama would now divide the hash outputs into 5 groups. 50% of the cache keys will now be on the same host as previously. This leads to fewer cache misses and more efficiency during pool changes.Illustration of key movement with Ketama hashing during pool changes

In reality, Ketama maps each host to dozens or hundreds of buckets, not just one, which increases the stability and evenness of hashing. Ketama hashing is the default in PHP’s memcached library, and has served Etsy well for years as our cache has grown. The increased efficiency of consistently finding keys across pool changes avoids expensive load spikes.

Hot keys and cache smearing

While consistent hashing is efficient there is a drawback to having each key exist on only one server. Different keys are used for different purposes, and certain cache items will be read and written more than others. In our experience, cache key read rates follow a power law distribution: a handful of keys are used for a majority of reads, while the majority of keys are read a small number of times. With a large number of keys and a large pool, a good hash function will distribute the “hot keys”—the most active keys—among many servers, smoothing out the load.

However, it is possible for a key to be so hot—read so many times—that it overloads its cache server. At Etsy, we’ve seen keys that are hit often enough, and store a large enough value, that they saturate the network interface of their cache host. The large number of clients are collectively generating a higher read rate than the cache server can provide.  We’ve seen this problem many times over the years, using mctop, memkeys, and our distributed tracing tools to track down hot keys.

Further horizontal scaling by adding more cache hosts doesn’t help in this case, because it only changes the distribution of keys to hosts—at best, it would only move the problem key and saturate a different host. Faster network cards in the cache pool also help, but can have a substantial hardware cost. In general, we try to use large numbers of lower-cost hardware, instead of relying a small number of super-powered hosts.

We use a technique I call cache smearing to help with hot keys. We continue to use consistent hashing, but add a small amount of entropy to certain hot cache keys for all reads and writes. This effectively turns one key into several, each storing the same data and letting the hash function distribute the read and write volume for the new keys among several hosts.

Let’s take an example of a hot key popular_user_data. This key is read often (since the user is popular) and is hashed to pool member 3. Cache smearing appends a random number in a small range (say, [0, 8)) to the key before each read or write. For instance, successive reads might look up popular_user_data3, popular_user_data1, and popular_user_data6. Because the keys are different, they will be hashed to different hosts. One or more may still be on pool member 3, but not all of them will be, sharing the load among the pool.

Cache smearing (named after the technique of smearing leap seconds over time) trades efficiency for throughput, making a classic space-vs-time tradeoff. The same value is stored on multiple hosts, instead of once per pool, which is not maximally efficient. But it allows multiple hosts to serve requests for that key, preventing any one host from being overloaded and increasing the overall read rate of the cache pool.

We manually add cache smearing to only our hottest keys, leaving keys that are less read-heavy efficiently stored on a single host. We ‘smear’ keys with just a few bits of entropy, that is, with a random number in a small range like 0 to 8 or 0 to 16. Choosing the size of that range is its own tradeoff. A too-large range duplicates the cached data more than necessary to reduce load, and can lead to data inconsistencies if a single user sees smeared values that are out-of-sync between multiple requests. A too-small range doesn’t spread the load effectively among the cache pool (and could even result in all the smeared keys still being hashed to the same pool member).

We’ve successfully smeared both memcached keys (by appending entropy directly to the key) and Varnish keys (with a new query parameter for the random value). While we could improve the manual process to track down and smear hot keys, the work we’ve done so far has been integral to scaling our cache pools to their current sizes and beyond.

The combination of consistent hashing and cache smearing has been a great combination of efficiency and practicality, using our cache space well while ameliorating the problems caused by hot keys. Keep these approaches in mind as you scale your caches!

You can follow me on Twitter at @kevingessner

Comments Off on How Etsy caches: hashing, Ketama, and cache smearing

Reducing Image File Size at Etsy

Posted by on May 30, 2017 / 11 Comments

Serving images is a critical part of Etsy’s infrastructure. Our sellers depend on displaying clean, visually appealing pictures to showcase their products. Buyers can learn about an item by reading its description, but there’s a tangibility to seeing a picture of that custom necklace, perfectly-styled shirt, or one-of-a-kind handbag you’ve found.

Our image infrastructure at its core is simple. Users will upload images, we convert them to jpegs on dedicated processing machines (ImgWriters) and store them in large datastores (Mediastores). Upon request, we have specialized hosts (ImgCaches) that fetch the images from the Mediastores and send them to our users.

A simplified view of our Photos architecture

Of course, there are many smaller pieces that make up our Photos infrastructure. For example, there’s load balancing involved with our ImgWriters to spread requests across the cluster, and our ImgCaches have a level of local in-memory caching for images that are frequently requested, coupled with a hashing scheme to reduce duplication of work. In front of that, there’s also caching on our CDN side. Behind the writers and caches, there’s some redundancy in our mediastores both for the sake of avoiding single point of failure scenarios and for scale.

Throughout this architecture, though, one of our biggest constraints is the size of the data we pass around. Our CDN’s will only be able to cache so much before they begin to evict images, at which point requests are routed back to our origin. Larger images will require heavier use of our CDN’s and drive up costs. For our users, as the size of images increases so too will the download time for a page, impacting performance and resulting in a suboptimal user experience. Likewise, any post-processing of images internally is more bandwidth we’re using on our networks and increases our storage needs. Smaller images help us scale further and faster.

As we’re using jpegs as our format, an easy way to reduce the data size of our images is to lower the jpeg quality value, compressing the contents of the image. Of course, as we compress further we’ll continue to reduce the visual quality of an image—edges will grow blurry and artifacts will appear. Below is an example of such. The image on the left has a jpeg quality of 92 while that on the right has one of 25. In this extreme case, we can see the picture of a bird on the right begins to show signs of compression – the background becomes “blocky” and there are “artifacts” around edges, such as around the bird and the blockiness of the background behind.

  

The left with jpeg quality 92, the right jpeg quality 25.

For most of Etsy’s history processing images, we’ve kept to a generally held constant that setting the quality of a jpeg image to 85 is assumed to be a “good balance”. Any larger and we probably won’t see enough of a difference to warrant all those extra bytes. But could we get away with lowering it to 80? Perhaps the golden number is 75? This is all very dependent upon the image a user uploads. A sunset with various gradients may begin to show the negative effects of lowering the jpeg quality before a picture of a ring on a white background. What we’d really like is to be able to determine what is a good jpeg quality to set per image.

To do so, we added an image comparison tool called dssim to our stack. Dssim computes how dissimilar two images are using the SSIM algorithm for estimating human vision. Given two (or more) images, it will compare the structural similarity and produce a value from 0 (exactly the same) to unbounded. Note that while the README states that it only compares two pngs, recent updates to the library have also allowed the use of comparison between jpegs.

With the user’s uploaded image as a base, we can do a binary search on the jpeg quality to find a balance between the visual quality (how good it looks) to the jpeg quality (the value that can help determine compression size). If we’ve assumed that 85 is a good ceiling for images, we can try lowering the quality to 75. Comparing the image at 85 and 75, if the value returned is higher than a known threshold of what is considered “good” for a dssim score, we can raise the jpeg quality back up to half way between, 80. If it’s below a threshold, maybe we can try push the quality a little lower at say 70. Either way, we continue to compare until we get to a happy medium.

This algorithm is adapted from Tobias Baldauf’s work in compressing jpegs. We first tested compression using this bash script before running into some issues with speed. While this would do well for offline asynchronous optimizations of images, we wanted to keep ours synchronous so that users could see the same image on upload as they would when it was displayed to users. We were breaking 15-20 seconds on larger images by shelling out from PHP and so dug in a bit to look for some performance wins.

Part of the slowdown was that cjpeg-dssim was converting from jpeg to png to compare with dssim. Since the time the original bash script was written, though, there had been an update to allow dssim to compare jpegs. We cut off half the time or more by not having to do the conversions each way. We also ported the code to PHP, as we already had the image blob in memory and could avoid some of the calls reading and writing the image to disk each time. We also adjusted to be a bit more conservative with some of the resizing, limiting it to a range of jpeg quality between 85 and 70 to short circuit the comparisons by starting at an image quality of 78 and breaking out if we went above 85 or below 70. As we cut numerous sizes for an image, we could in theory perform this for each resized image, but we rarely saw the jpeg quality value differ and so use the calculated jpeg quality for one size on all cropped and resized images of a given upload.

As you’d expect, we’re adding work to our image processing and so had to take that into account. We believed that the additional upfront cost in time during the upload would pay for itself many times over. The speed savings in downloading images means an improvement in the time to render the important content on a page. We were adding 500ms to our processing time on average and about 800ms for the upper 90th percentile, which seemed within acceptable limits.

Time to calculate quality

This will vary, though, on the dimensions of the image you’re processing. We’re choosing an image we display often, which is 570px in width and a variable height (capped at 1500px). Larger images will take longer to process.

As we strive to Graph All the Things™, we kept track of how the jpeg quality changed from the standard 85 to vary from image to image, settling at roughly 73 on average.

The corresponding graphs for bytes stored, as expected, dropped as well. We saw reductions often ranging from 25% to 30% in the size of images stored, comparing the file sizes of images in April of 2016 with those in April of 2017 on the same day:

Total file size of images in bytes in 1 week in April, 2016 vs 2017

Page load performance

So how does that stack up to page load on Etsy? As an example, we examined the listing page on Etsy, which displays items in several different sizes: a medium size when the page first loads, a large size when zoomed in, and several tiny thumbnails in the carousel as well as up to 8 on the right that show other items from the shop. Here is a breakdown of content downloaded (with some basic variations depending upon the listing fetched):

MIME Type Bytes
js 6,119,491
image 1,022,820
font 129,496
css 122,731
html 86,744
other 3,838
Total 7,485,120

At Etsy, we frontload our static assets (js, fonts, css) and share it throughout the site, though, so that is often cached when a user reaches the listings page. Removing those three, which total 6,371,718 bytes, we’re left with 1,113,402 bytes, of which roughly 92% is comprised of image data. This means once a user has loaded the site’s assets, the vast majority of their download is then the images we serve on a given page as they navigate to each listing. If we apply our estimate of 23% savings to images, reducing image data to 787,571 and the page data minus assets to 878,153, on a 2G mobile device (roughly 14.4 kbps) it’s the difference in downloading a page in 77.3 seconds vs 60.8 seconds. For areas without 4G service, that’s a big difference! This of course is just one example page and the ratio of image size to page size will vary depending upon the listing.

Images that suffer more

Not all images are created equal. What might work for a handful of images is not going to be universal. We saw that some images in particular have a greater drop in jpeg quality while also suffering visual quality in our testing. Investigating, there are a few cases where this system could use improvements. In particular, jpeg quality typically suffers more when images contain line art or thin lines, text, and gradients, as the lossy compression algorithms for jpegs general don’t play well in these contexts. Likewise, images that have a large background area of a solid color and a small focus of the product would suffer, as the compression would run heavily on large portions of empty space in the image and then over compress the focus of the image. This helped inform our lower threshold of 70 for jpeg quality, at which we decided further compression wasn’t worth the loss in visual quality.

The first uncompressed, the second compressed to jpeg quality 40.

 

Other wins

Page load speed isn’t the only win for minifying images, as our infrastructure also sees benefits from reducing the file size of images. CDN’s are a significant piece of our infrastructure, allowing for points of presence in various places in the world to serve static assets closer to the location of request and to offload some of the hits to our origin servers. Small images means we can cache more files before running into eviction limits and reduce the amount we’re spending on CDN’s. Our internal network infrastructure also scales further as we’re not pushing as many bytes around the line.

 

Availability of the tool and future work

We’re in the process of abstracting the code behind this into an open source tool to plug into your PHP code (or external service). One goal of writing this tool was to make sure that there was flexibility in determining what “good” is, as that can be very subjective to users and their needs. Perhaps you’re working with very large images and are willing to handle more compression. The minimum and maximum jpeg quality values allow you to fine tune the amount of compression that you’re willing to accept. On top of that, the threshold constants we use to narrow in when finding the “sweet spot” with dssim are configurable, giving you leverage to adjust in another direction. In the meantime, following the above guidelines on using a binary search approach combined with dssim should get you started.

 

Closing

Not all of the challenges behind these adjustments are specifically technical. Setting the jpeg quality of an image with most software applications is straightforward, as it is often an additional parameter to include when converting images. Deploying such a change, though, requires extensive testing over many different kinds of images. Knowing your use cases and the spectrum of images that make up your site is critical to such a change. With such a shift, knowing how to roll back, even if it requires a backfill, can also be useful as a safety net. Lastly, there’s merit in knowing that tweaking images is rarely “finished”. Some users like an image that heavily relies on sharpening images, while another might want some soft blending when resizing. Keeping this in mind was important for us to find out how we can continue to improve upon the experience our users have, both in terms of overall page speed and display of their products.

11 Comments

How Etsy Ships Apps

Posted by on May 15, 2017 / 6 Comments

In which Etsy transforms its app release process by aligning it with its philosophy for web deploys

illustration of a boat on a wave. There are several people in the boat, and the boat has a flag with an "E" on it.

Anchors Aweigh

Deploying code should be easy. It should happen often, and it should involve its engineers. For Etsyweb, this looks like continuous deployment.

A group of engineers (which we call a push train) and a designated driver all shepherd their changes to a staging environment, and then to production. At each checkpoint along that journey, the members of the push train are responsible for testing their changes, sharing that they’re ready to ship, and making sure nothing broke. Everyone in that train must work together for the safe completion of their deployment. And this happens very frequently: up to 50 times a day.

                                               TOPIC: clear
 mittens> .join                                TOPIC: mittens
   sasha> .join with mittens                   TOPIC: mittens + sasha
 pushbot> mittens, sasha: You're up            TOPIC: mittens + sasha
   sasha> .good                                TOPIC: mittens + sasha*
 mittens> .good                                TOPIC: mittens* + sasha*
 pushbot> mittens, sasha: Everyone is ready    TOPIC: mittens* + sasha*
  nassim> .join                                TOPIC: mittens* + sasha* | nassim
 mittens> .at preprod                          TOPIC: <preprod> mittens + sasha | nassim
 mittens> .good                                TOPIC: <preprod> mittens* + sasha | nassim
   sasha> .good                                TOPIC: <preprod> mittens* + sasha* | nassim
 pushbot> mittens, sasha: Everyone is ready    TOPIC: <preprod> mittens* + sasha* | nassim
 mittens> .at prod                             TOPIC: <prod> mittens + sasha | nassim
 mittens> .good                                TOPIC: <prod> mittens* + sasha | nassim
     asm> .join                                TOPIC: <prod> mittens* + sasha | nassim + asm
   sasha> .good                                TOPIC: <prod> mittens* + sasha* | nassim + asm
     asm> .nm                                  TOPIC: <prod> mittens* + sasha* | nassim
 pushbot> mittens, sasha: Everyone is ready    TOPIC: <prod> mittens* + sasha* | nassim
 mittens> .done                                TOPIC: nassim
 pushbot> nassim: You're up                    TOPIC: nassim
    lily> .join                                TOPIC: nassim | lily

This strategy has been successful for a lot of reasons, but especially because each deploy is handled by the people most familiar with the changes that are shipping. Those that wrote the code are in the best position to recognize it breaking, and then fix it. Because of that, developers should be empowered to deploy code as needed, and remain close to its rollout.

App releases are a different beast. They don’t easily adapt to that philosophy of deploying code. For one, they have versions and need to be compiled. And since they’re distributed via app stores, those versions can take time to reach end users. Traditionally, these traits have led to strategies involving release branches and release managers. Our app releases started out this way, but we learned quickly that they didn’t feel very Etsy. And so we set out to change them.

Jen and Sasha

photo of Sasha Friedenberg and Jen MacchiarelliWe were the release managers. Jen managed the Sell on Etsy apps, and I managed the Etsy apps. We were responsible for all release stage transitions, maintaining the schedule, and managing all the communications around releases. We were also responsible for resolving conflicts and coordinating cross-team resources in cases of bugs and urgent blockers to release.

Ready to Ship

A key part of our job was making sure everyone knew what they’re supposed to do and when they’re supposed to do it. The biggest such checkpoint is when a release branches — this is when we create a dedicated branch for the release off master, and master becomes the next release. This is scheduled and determines what changes make it into production for a given release. It’s very important to make sure that those changes are expected, and that they have been tested.

For Jen and me, it would’ve been impossible to keep track of the many changes in a release ourselves, and so it was our job to coordinate with the engineers that made the actual changes and make sure those changes were expected and tested. In practice, this meant sending emails or messaging folks when approaching certain checkpoints like branching. And likewise, if there were any storm warnings (such as show-stopping bugs), it was our responsibility to raise the flag to notify others.

list of words indicating various responsibilities for release managersThen Jen left Etsy for another opportunity, and I became a single-point-of-failure and a gatekeeper. Every release decision was funneled through me, and I was the only person able to make and execute those decisions.

I was overwhelmed. Frustrated. I was worried I’d be stuck navigating iTunes Connect and Google Play, and sending emails. And frankly, I didn’t want to be doing those things. I wanted those things to be automated. Give me a button to upload to iTunes Connect, and another to begin staged rollout on Google Play. Thinking about the ease of deploying on web just filled me with envy.

This time wasn’t easy for engineers either. Even back when we had two release managers, from an engineer’s perspective, this period of app releases wasn’t transparent. It was difficult to know what phase of release we were in. A large number of emails was sent, but few of them were targeted to those that actually needed them. We would generically send emails to one big list that included all four of our apps. And all kinds of emails would get sent there. Things that were FYI-only, and also things that required urgent attention. We were on the path to alert-fatigue.

All of this meant that engineers felt more like they were in the cargo hold, rather than in the cockpit. But that just didn’t fit with how we do things for web. It didn’t fit with our philosophy for deployment. We didn’t like it. We wanted something better, something that placed engineers in front of the tiller.

Ship

screenshot showing Etsy's Ship's main page

So we built a vessel that coordinates the status, schedule, communications, and deploy tools for app releases. Here’s how Ship helps:

It’s hard to imagine all of that abstractly, so here’s an example:

Captain’s Log

Monday

Tuesday

Wednesday

Friday

Tuesday

Wednesday

Before Ship, all of these components above would’ve been performed manually. But you’ll notice that release managers are missing from the above script; have we replaced release managers with all the automations in Ship?

photo of Etsy listing featuring a flag used by nautical ships to communicate: the flag says "I require a pilot"

Release Drivers

Partially. Ship has a feature where each release is assigned a driver.

screenshot showing Etsy's Ship's driver section. The driver in this case is Hannah Mittelstaedt

This driver is responsible for a bunch of things that we couldn’t or shouldn’t automate. Here’s what they’re responsible for:

Everything else? That’s automated. Branching, release candidate generation, submission to iTunes Connect — even staged rollout on Google Play! But, we’ve learned from automation going awry before. By default, some things are set to manual. There are others for which Ship explicitly does not allow automation, such as continuing staged rollout on Google Play. Things like this should involve and require human interaction. For everything else that is automated, we added a failsafe: at any time, a driver can disable all the crons and take over driving from autopilot:

screenshot of Etsy's Ship showing the various automation options that can be set on a per-releases basis.

When a driver wants to do something manually, they don’t need access to iTunes Connect or Google Play, as each of these things is made accessible as a button. A really nice side effect of this is that we don’t have to worry about provisioning folks for either app store, and we have a clear log of every release-related action taken by drivers.

Drivers are assigned once a release moves onto master, and are semi-randomly selected based on previous drivers and engineers that have committed to previous releases. Once assigned, we send them an onboarding email letting them know what their responsibilities are:

screenshot of Etsy's Ship's driver onboarding email

Ready to Ship Again

The driver can remain mostly dormant until the day of branching. A couple hours before we branch, it’s the driver’s responsibility to make sure that all the impacting engineers are ready to ship, and to orchestrate efforts when they’re not. After we’re ready, the driver’s responsibility is to remain available as a point-of-contact while final testing takes place. If an issue comes up, the driver may be consulted for steps to resolve.

And then, assuming all goes well, comes release day. The driver can opt to manually release, or let the cron do this for them — they’ll get notified if something goes wrong, either way. Then a day after we release, the driver looks at all of our dashboards, logs, and graphs to confirm the health of the release.

Bugfixes

But not all releases are planned. Things fail, and that’s expected. It’s naïve to assume some serious bug won’t ship with an app release. There’s plenty of things that can and will be the subject of a post-mortem. When one of those things happens, any engineer can spawn a bugfix release off the most-recently-released mainline release.

diagram showing how a bugfix branch comes off of an existing release branch and not master

The engineer that requests this bugfix gets assigned as the driver for that release. Once they branch the release, they make the necessary bugfixes (others can join in to add bugfixes too, if they coordinate with the driver) in the release’s branch, build a release candidate, test it, and get it ready for production. The driver can then release it at will.

State Machine

Releases are actually quite complicated.

diagram indicating the state machine that powers Etsy's Ship

It starts off as an abstract thing that will occur in the future. Then becomes a concrete thing actively collecting changes via commits on master in git. After this period of collecting commits, the release is considered complete and moves into its own dedicated branch. The release candidate is then built from this dedicated branch, which then gets thoroughly tested, and moved into production. The release itself then concludes as an unmerged branch.

Once a release branches, the next future release moves onto master. Each release is its own state machine, where the development and branching states overlap between successive releases.

Notifications: Slack and Email

screenshot showing a notification sent via Slack

Plugged into the output of Ship are notifications. Because there are so many points of interest en route to production, it’s really important that the right people are notified at the right times. So we use the state machine of Ship to send out notifications to engineers (and other subscribers) based on how much they asked to know, and how they impacted the release. We also allow anyone to sign up for notifications around a release. This is used by product managers, designers, support teams, engineering managers, and more. Our communications are very targeted to those that need or want them.

In terms of what they asked to know, we made it very simple to get detailed emails about state changes to a release:

screenshot of Etsy's Ship showing the subscription options for engineers and non-engineers on a per-release basis

In terms of how they impacted the release, we need to get that data from somewhere else.

Git

We mentioned data Ship receives from outside sources. At Etsy, we use GitHub for our source control. Our apps have repos per-platform (Android and iOS). In order to keep Ship’s knowledge of releases up-to-date, we set up GitHub Webhooks to notify Ship whenever changes are pushed to the repo. We listen for two changes in particular: pushes to master, and pushes to any release branch.

When Ship gets notified, it iterates through the commits and uses the author, changed paths, and commit message to determine which app (buyer or seller) the commit affects, and which release we should attribute this change to. Ship then takes all of that and combines it into a state that represents every engineer’s impact on a given release. Is that engineer “user-impacting” or “dark” (our term for changes that aren’t live)? Ship then uses this state to determine who is a member of what release, and who should get notified about what events.

Additionally, at any point during a release, an engineer can change their status. They may want to do this if they want to receive more information about a release, or if Ship misunderstood one of their commits as being impacting to the release.

Deployinator

Everything up until has explained how Ship keeps track of things. But there’s been no explanation for how some of the automated actions affecting the app repo or things outside Etsy occur.

We have a home-grown tool for managing deploys called Deployinator, and we added app support. It can now perform mutating interactions with the app repos, as well as all the deploy actions related to Google Play and iTunes Connect. This is where we build the testing candidates, release candidate, branch the release, submit to iTunes Connect, and much more.

We opted to use Deployinator for a number of reasons:

In our custom stack, we have crons. This is how we branch on Tuesday evening (assuming everyone is ready). This is where we interface with Google Play and iTunes Connect. We make use of Google Play’s official API in a custom python module we wrote, and for iTunes Connect we use Spaceship to interface with the unofficial API.

Seaworthy

The end result of Ship is that we’ve distributed release management. Etsy no longer has any dedicated release managers. But it does have an engineer who used to be one — and I even get to drive a release every now and then.

People cannot be fully automated away. That applies to our web deploys, and is equally true for app releases. Our new process works within that reality. It’s unique because it pushes the limit of what we thought could be automated. Yet, at the same time, it empowers our app engineers more than ever before. Engineers control when a release goes to prod. Engineers decide if we’re ready to branch. Engineers hit the buttons.

And that’s what Ship is really about. It empowers our engineers to deliver the best apps for our users. Ship puts engineers at the helm.

6 Comments

Modeling Spelling Correction for Search at Etsy

Posted by on May 1, 2017 / 4 Comments

Introduction

When a user searches for an item on Etsy, they don’t always type what they mean. Sometimes they type the query jewlery when they’re looking for jewelry; sometimes they just accidentally hit an extra key and type dresss instead of dress. To make their online shopping experience successful, we need to identify and fix these mistakes, and display search results for the canonical spelling of the actual intended query. With this motivation in mind, and through the efforts of the Data Science, Linguistic Tools, and the Search Infrastructure teams, we overhauled the way we do spelling correction at Etsy late last year.

The older service was based on a static mapping of misspelled words to their respective corrections, which was updated only infrequently. It would split a search query into its tokens, and replace any token that was identified as a misspelling with its correction. Although the service allowed a very fast way of retrieving a correction, it was not an ideal solution for a few reasons:

  1. It could only correct misspellings that had already been identified and added to the map; it would leave previously unseen misspellings uncorrected.
  2. There was no systematic process in place to update the misspelling-correction pairs in the map.
  3. There was no mechanism of inferring the word from the context of surrounding words. For instance, if someone intended to type butter dish, but instead typed buttor dish, the older system would correct that to button dish since buttor is a more commonly observed misspelling for button at Etsy.
  4. Additionally, the older system would not allow one-to-many mappings from misspellings to corrections. Storing both buttor ->  button and buttor -> butter mappings simultaneously would not be possible without modifying the underlying data structure.

We, therefore, decided to upgrade the spelling correction service to one based on a statistical model. This model learns from historical user data on Etsy’s website and uses the context of surrounding words to offer the most probable correction.

Although this blog post outlines some aspects of the infrastructure components of the service, its main focus is on describing the statistical model used by it.

Model

We use a model that is based upon the Noisy Channel Model, which was historically used to infer telegraph messages that got distorted over the line. In the context of a user typing an incorrectly spelled word on Etsy, the “distortion” could be from accidental typos or a result of the user not knowing the correct spelling. In terms of probabilities, our goal is to determine the probability of the correct word, conditional on the word typed by the user. That probability, as per Bayes’ rule, is proportional to the product of two probabilities:

We are able to estimate the probabilities on the right-hand side of the relation above, using historical searches performed by our users. We describe this in more detail below.

To extend this idea to multi-word phrases, we use a simple Hidden Markov Model (HMM) which determines the most probable sequence of tokens to suggest as a spelling correction. Markov refers to being able to predict the future state of the system based solely on its present state. Hidden refers to the system having hidden states that we are trying to discover. In our case, the hidden states are the correct spellings of the words the user actually typed.

The main components of an HMM are explained via the figure below. Assume, for the purpose of this post, that the user searched for Flower Girl Baske when they intended to search for Flower Girl Basket. The main components of an HMM are then as follows:

  1. Observed states: the words explicitly typed by the user, Flower Girl Baske (represented as circles).
  2. Hidden states: the correct spellings of the words the user intended to type, Flower Girl Basket (represented as squares).
  3. Emission probability: the conditional probability of observing a state given the hidden state.
  4. Transition probability: the probability of observing a state conditional upon the probability of the immediately previous observed state, like, for instance, in the figure below, the probability of transitioning from Girl to Baske, i.e., P(Baske|Girl). The fact that this probability does not depend on the probability of having observed Flower, the state that precedes Girl, illustrates the Markov property of the model.

Once the spelling correction service receives the query, it first splits the query into three tokens, and then suggests possible corrections for each of the tokens as shown in the figure below.

In each column, one of the correction possibilities represents the true hidden state of the token that was typed (emitted) by the user. The most probable sequence can be thought of as identifying the true hidden state for each token given the context of the previous token. To accomplish this, we need to know probabilities from two distributions: first, the probability of typing a misspelling given the correct spelling of the word, the emission probability, and, second, the probability of transitioning from one observed word to another, the transition probability.

Once we are able to calculate all the emission probabilities, and the transition probabilities needed, as described in the sections below, we can determine the most probable sequence using a common dynamic programming algorithm known as the Viterbi Algorithm.

Error Model

The emission probability is estimated through an Error Model, which is a statistical model created by inferring historical corrections provided by our own users while making searches on Etsy. The heuristic we used is based on two conditions: a user’s search being followed immediately by another search similar to the first, and with the second search then leading to a listing click. We align these tokens to each other in a way that minimizes the number of edits needed on the misspelled query to transform it into the corrected second query.

We then make aligned splits on the characters, and calculate counts and associated conditional probabilities. We generate character level probabilities for four types of operations: substitution, insertion, deletion, and transposition. Of these, insertion and deletion also require us to also keep track of the context of neighboring letters — the probability of adding a letter after another, or removing a letter appearing after another, respectively.

To continue with the earlier example, when the user typed baske, the emitted token, they were probably trying to type basket, the hidden state for baske, which corresponds to the probability P(Baske|Basket). Error probabilities for tokens, such as these, are calculated by multiplying the character-level probabilities which are assumed to be independent. For instance, the probability of correcting the letter e by appending the letter t to it is given by:

Here, the numerator is the number of times in our dataset where we see an e being corrected to an et, and the denominator is the number of corrections where any letter was appended after e. Since we assume that the probability associated with a character remaining unchanged is 1, in our case, the probability P(Baske|Basket) is solely dependent on the insertion probability P(et|e).

Language Model

Transition probabilities are estimated through a language model which is determined by calculating the unigram and bigram token frequencies seen in our historical search queries.

A specific instance of that, from the chosen example, would be the probability of going from one token in the Flower (first) column, say Flow, to a token, say Girl, in the Girl (second) column, which is represented as P(Girl|Flow). We are able to determine that probability from the following ratio of bigram token counts to unigram token counts:

The error models and language models are generated through Scalding-powered jobs that run on our production hadoop cluster.

Viterbi Algorithm Heuristic

To generate inferences using these models, we employ the Viterbi Algorithm to determine the optimal query, i.e., sequence of tokens to serve as a spelling correction. The main idea is to present the most probable sequence as the spelling correction. The iterative process goes, as per the figure above, from the first column of tokens to the last. At the nth iteration, we have the sequence with the maximum probability available from the previous iteration, and we choose the token from the nth column which increases the maximum probability from the previous iteration the most.

Let’s explain this more concretely by describing the third iteration for our main example: assume that we already have Flower girl as the most probable phrase at the second iteration. Now we pick the token from the third column that corresponds to the maximum of the following transition probability and emission probability products:

We can see from the figure above that the transition probability from Girl to Basket is high enough to overcome the lower emission probability of someone typing Baske when they mean to type Basket, and that, consequently, Basket is the word we want to add to the existing Flower Girl phrase.

We now know that Flower Girl Basket is the most probable correction as predicted by our models. The decision about whether we suggest this correction to the user, and how we suggest it, is made on the basis of its confidence score. We describe that process a little later in the post.

Hyperparameters and Model Training

We added a few hyperparameters to the bare bones model we have described so far. They are as follows:

  1. We added an exponent to the emission probability because we want the model to have some flexibility in weighting the error model component in relation to the language model component.
  2. Our threshold for offering a correction is a linear function of the number of tokens in the query. We, therefore, have a slope and the intercept of the threshold as a pair of hyperparameters.
  3. Finally, we also have a parameter corresponding to the probability of words our language model does not know about.

Our training data set consists of two types of examples: the first kind are of the form misspelling -> correct spelling, like jewlery -> jewelry, while the second kind are of the form correct spelling -> _, like dress -> _. The second type corresponds to queries that are already correct, and therefore don’t need a correction. This setup enables our model to distinguish between the populations of both correct and incorrect spellings, and to offer corrections for the latter.

We tune our hyperparameters via 10-fold cross-validation using a hill-climbing algorithm that maximizes the f-score on our training data set. The f-score is the harmonic mean of the precision and the recall. Precision measures how many of the corrections offered by the system are correct and, for our data set, is defined as:

Recall is a measure of coverage — it tries to answer the question, “how many of the spelling corrections that we should be offering are we offering?”. It is defined as:

Optimizing on the f-score is allows us to increase coverage of the spelling corrections we offer while keeping the number of invalid corrections offered in check.

Serving the Correction

We serve three types of corrections at Etsy based on the confidence we have in the correction. We use the following odds-ratio as our confidence score:

If the confidence is greater than a certain threshold, we display search results corresponding to the correction, along with a “Search instead of” link to see the results for the original query instead. If the confidence is below a certain threshold, we offer results of the original query, with a “Did you mean” option to the suggested correction. If we don’t have any results for the original query, but do have results for the corrected query, we display results for the correction independent of the confidence score.

Spelling Correction Service

With the new service in place, when a search is made on Etsy, we fire off requests to both the new Java-based spelling service and the existing search service. The response from the spelling service includes both the correction and its confidence score, and how we display the correction, based on the confidence score, is described in the previous section. For high-confidence corrections, we make an additional request for search results with the corrected query, and display those results if their count is greater than a certain threshold.

When the spelling service receives a query, it first splits it into its constituent tokens. It then makes independent suggestions for each of the tokens, and, finally, strings together from those suggestions a sequence of the most probable tokens. This corrected sequence is sent back to the web stack as a response to the spelling service request.

The correction suggestions for each token are generated by Lucene’s DirectSpellChecker class which queries an index generated from tokens that are generated through a Hadoop job. The Hadoop job counts tokens from historical queries, rejecting any tokens that appear too infrequently or those that appear only in search queries with very few search results. We have a daily cron that ships the tokens, along with the generated Error and Language model files, from HDFS to the boxes that host the spelling correction service. A periodic rolling restart of the service across all boxes ensures that the freshest models and index tokens are picked up by the service.

Follow-up Work

The original implementation of the model was limited, in the sense, that it could not suggest corrections that were, at the token level, more than two edits away from the original token. We have subsequently implemented an auxiliary framework of suggesting correction tokens that fixes this issue.

Another limitation was related to splitting and compounding of tokens in the original query to make suggestions. For instance, we were not able to suggest earring as a suggestion for ear ring. Some of our coworkers are working on modifying the service to accommodate corrections of this type.

Although we do use supervised learning to train the hyperparameters of our models, since launching the service we have additional inputs from our users which we can improve our models with. Specifically, users clicking the “Did you mean” link on our low-confidence corrections results page provides us with explicit positive feedback, while clicks on the “Search instead for” link on our high-confidence corrections results page provides us with negative feedback. The next major evolution of the model would be to explicitly use this feedback to improve corrections.

(This project was a collaboration between Melanie Gin and Benjamin Russell on the Linguistic Tools team who built out the infrastructure and front-end; Caitlin Cellier and Mohit Nayyar on the Data Science team who worked on implementing the statistical model; and Zhi-Da Zhong on the Search Infrastructure team who consulted on infrastructure design. A special thanks to Caitlin whose presentation some of these figures are copied from.)

4 Comments