Serving images is a critical part of Etsy’s infrastructure. Our sellers depend on displaying clean, visually appealing pictures to showcase their products. Buyers can learn about an item by reading its description, but there’s a tangibility to seeing a picture of that custom necklace, perfectly-styled shirt, or one-of-a-kind handbag you’ve found.
Our image infrastructure at its core is simple. Users will upload images, we convert them to jpegs on dedicated processing machines (ImgWriters) and store them in large datastores (Mediastores). Upon request, we have specialized hosts (ImgCaches) that fetch the images from the Mediastores and send them to our users.
Of course, there are many smaller pieces that make up our Photos infrastructure. For example, there’s load balancing involved with our ImgWriters to spread requests across the cluster, and our ImgCaches have a level of local in-memory caching for images that are frequently requested, coupled with a hashing scheme to reduce duplication of work. In front of that, there’s also caching on our CDN side. Behind the writers and caches, there’s some redundancy in our mediastores both for the sake of avoiding single point of failure scenarios and for scale.
Throughout this architecture, though, one of our biggest constraints is the size of the data we pass around. Our CDN’s will only be able to cache so much before they begin to evict images, at which point requests are routed back to our origin. Larger images will require heavier use of our CDN’s and drive up costs. For our users, as the size of images increases so too will the download time for a page, impacting performance and resulting in a suboptimal user experience. Likewise, any post-processing of images internally is more bandwidth we’re using on our networks and increases our storage needs. Smaller images help us scale further and faster.
As we’re using jpegs as our format, an easy way to reduce the data size of our images is to lower the jpeg quality value, compressing the contents of the image. Of course, as we compress further we’ll continue to reduce the visual quality of an image—edges will grow blurry and artifacts will appear. Below is an example of such. The image on the left has a jpeg quality of 92 while that on the right has one of 25. In this extreme case, we can see the picture of a bird on the right begins to show signs of compression – the background becomes “blocky” and there are “artifacts” around edges, such as around the bird and the blockiness of the background behind.
The left with jpeg quality 92, the right jpeg quality 25.
For most of Etsy’s history processing images, we’ve kept to a generally held constant that setting the quality of a jpeg image to 85 is assumed to be a “good balance”. Any larger and we probably won’t see enough of a difference to warrant all those extra bytes. But could we get away with lowering it to 80? Perhaps the golden number is 75? This is all very dependent upon the image a user uploads. A sunset with various gradients may begin to show the negative effects of lowering the jpeg quality before a picture of a ring on a white background. What we’d really like is to be able to determine what is a good jpeg quality to set per image.
To do so, we added an image comparison tool called dssim to our stack. Dssim computes how dissimilar two images are using the SSIM algorithm for estimating human vision. Given two (or more) images, it will compare the structural similarity and produce a value from 0 (exactly the same) to unbounded. Note that while the README states that it only compares two pngs, recent updates to the library have also allowed the use of comparison between jpegs.
With the user’s uploaded image as a base, we can do a binary search on the jpeg quality to find a balance between the visual quality (how good it looks) to the jpeg quality (the value that can help determine compression size). If we’ve assumed that 85 is a good ceiling for images, we can try lowering the quality to 75. Comparing the image at 85 and 75, if the value returned is higher than a known threshold of what is considered “good” for a dssim score, we can raise the jpeg quality back up to half way between, 80. If it’s below a threshold, maybe we can try push the quality a little lower at say 70. Either way, we continue to compare until we get to a happy medium.
This algorithm is adapted from Tobias Baldauf’s work in compressing jpegs. We first tested compression using this bash script before running into some issues with speed. While this would do well for offline asynchronous optimizations of images, we wanted to keep ours synchronous so that users could see the same image on upload as they would when it was displayed to users. We were breaking 15-20 seconds on larger images by shelling out from PHP and so dug in a bit to look for some performance wins.
Part of the slowdown was that cjpeg-dssim was converting from jpeg to png to compare with dssim. Since the time the original bash script was written, though, there had been an update to allow dssim to compare jpegs. We cut off half the time or more by not having to do the conversions each way. We also ported the code to PHP, as we already had the image blob in memory and could avoid some of the calls reading and writing the image to disk each time. We also adjusted to be a bit more conservative with some of the resizing, limiting it to a range of jpeg quality between 85 and 70 to short circuit the comparisons by starting at an image quality of 78 and breaking out if we went above 85 or below 70. As we cut numerous sizes for an image, we could in theory perform this for each resized image, but we rarely saw the jpeg quality value differ and so use the calculated jpeg quality for one size on all cropped and resized images of a given upload.
As you’d expect, we’re adding work to our image processing and so had to take that into account. We believed that the additional upfront cost in time during the upload would pay for itself many times over. The speed savings in downloading images means an improvement in the time to render the important content on a page. We were adding 500ms to our processing time on average and about 800ms for the upper 90th percentile, which seemed within acceptable limits.
This will vary, though, on the dimensions of the image you’re processing. We’re choosing an image we display often, which is 570px in width and a variable height (capped at 1500px). Larger images will take longer to process.
As we strive to Graph All the Things™, we kept track of how the jpeg quality changed from the standard 85 to vary from image to image, settling at roughly 73 on average.
The corresponding graphs for bytes stored, as expected, dropped as well. We saw reductions often ranging from 25% to 30% in the size of images stored, comparing the file sizes of images in April of 2016 with those in April of 2017 on the same day:
Total file size of images in bytes in 1 week in April, 2016 vs 2017
Page load performance
So how does that stack up to page load on Etsy? As an example, we examined the listing page on Etsy, which displays items in several different sizes: a medium size when the page first loads, a large size when zoomed in, and several tiny thumbnails in the carousel as well as up to 8 on the right that show other items from the shop. Here is a breakdown of content downloaded (with some basic variations depending upon the listing fetched):
At Etsy, we frontload our static assets (js, fonts, css) and share it throughout the site, though, so that is often cached when a user reaches the listings page. Removing those three, which total 6,371,718 bytes, we’re left with 1,113,402 bytes, of which roughly 92% is comprised of image data. This means once a user has loaded the site’s assets, the vast majority of their download is then the images we serve on a given page as they navigate to each listing. If we apply our estimate of 23% savings to images, reducing image data to 787,571 and the page data minus assets to 878,153, on a 2G mobile device (roughly 14.4 kbps) it’s the difference in downloading a page in 77.3 seconds vs 60.8 seconds. For areas without 4G service, that’s a big difference! This of course is just one example page and the ratio of image size to page size will vary depending upon the listing.
Images that suffer more
Not all images are created equal. What might work for a handful of images is not going to be universal. We saw that some images in particular have a greater drop in jpeg quality while also suffering visual quality in our testing. Investigating, there are a few cases where this system could use improvements. In particular, jpeg quality typically suffers more when images contain line art or thin lines, text, and gradients, as the lossy compression algorithms for jpegs general don’t play well in these contexts. Likewise, images that have a large background area of a solid color and a small focus of the product would suffer, as the compression would run heavily on large portions of empty space in the image and then over compress the focus of the image. This helped inform our lower threshold of 70 for jpeg quality, at which we decided further compression wasn’t worth the loss in visual quality.
The first uncompressed, the second compressed to jpeg quality 40.
Page load speed isn’t the only win for minifying images, as our infrastructure also sees benefits from reducing the file size of images. CDN’s are a significant piece of our infrastructure, allowing for points of presence in various places in the world to serve static assets closer to the location of request and to offload some of the hits to our origin servers. Small images means we can cache more files before running into eviction limits and reduce the amount we’re spending on CDN’s. Our internal network infrastructure also scales further as we’re not pushing as many bytes around the line.
Availability of the tool and future work
We’re in the process of abstracting the code behind this into an open source tool to plug into your PHP code (or external service). One goal of writing this tool was to make sure that there was flexibility in determining what “good” is, as that can be very subjective to users and their needs. Perhaps you’re working with very large images and are willing to handle more compression. The minimum and maximum jpeg quality values allow you to fine tune the amount of compression that you’re willing to accept. On top of that, the threshold constants we use to narrow in when finding the “sweet spot” with dssim are configurable, giving you leverage to adjust in another direction. In the meantime, following the above guidelines on using a binary search approach combined with dssim should get you started.
Not all of the challenges behind these adjustments are specifically technical. Setting the jpeg quality of an image with most software applications is straightforward, as it is often an additional parameter to include when converting images. Deploying such a change, though, requires extensive testing over many different kinds of images. Knowing your use cases and the spectrum of images that make up your site is critical to such a change. With such a shift, knowing how to roll back, even if it requires a backfill, can also be useful as a safety net. Lastly, there’s merit in knowing that tweaking images is rarely “finished”. Some users like an image that heavily relies on sharpening images, while another might want some soft blending when resizing. Keeping this in mind was important for us to find out how we can continue to improve upon the experience our users have, both in terms of overall page speed and display of their products.
In which Etsy transforms its app release process by aligning it with its philosophy for web deploys
Deploying code should be easy. It should happen often, and it should involve its engineers. For Etsyweb, this looks like continuous deployment.
A group of engineers (which we call a push train) and a designated driver all shepherd their changes to a staging environment, and then to production. At each checkpoint along that journey, the members of the push train are responsible for testing their changes, sharing that they’re ready to ship, and making sure nothing broke. Everyone in that train must work together for the safe completion of their deployment. And this happens very frequently: up to 50 times a day.
mittens> .join TOPIC: mittens
sasha> .join with mittens TOPIC: mittens + sasha
pushbot> mittens, sasha: You're up TOPIC: mittens + sasha
sasha> .good TOPIC: mittens + sasha*
mittens> .good TOPIC: mittens* + sasha*
pushbot> mittens, sasha: Everyone is ready TOPIC: mittens* + sasha*
nassim> .join TOPIC: mittens* + sasha* | nassim
mittens> .at preprod TOPIC: <preprod> mittens + sasha | nassim
mittens> .good TOPIC: <preprod> mittens* + sasha | nassim
sasha> .good TOPIC: <preprod> mittens* + sasha* | nassim
pushbot> mittens, sasha: Everyone is ready TOPIC: <preprod> mittens* + sasha* | nassim
mittens> .at prod TOPIC: <prod> mittens + sasha | nassim
mittens> .good TOPIC: <prod> mittens* + sasha | nassim
asm> .join TOPIC: <prod> mittens* + sasha | nassim + asm
sasha> .good TOPIC: <prod> mittens* + sasha* | nassim + asm
asm> .nm TOPIC: <prod> mittens* + sasha* | nassim
pushbot> mittens, sasha: Everyone is ready TOPIC: <prod> mittens* + sasha* | nassim
mittens> .done TOPIC: nassim
pushbot> nassim: You're up TOPIC: nassim
lily> .join TOPIC: nassim | lily
This strategy has been successful for a lot of reasons, but especially because each deploy is handled by the people most familiar with the changes that are shipping. Those that wrote the code are in the best position to recognize it breaking, and then fix it. Because of that, developers should be empowered to deploy code as needed, and remain close to its rollout.
App releases are a different beast. They don’t easily adapt to that philosophy of deploying code. For one, they have versions and need to be compiled. And since they’re distributed via app stores, those versions can take time to reach end users. Traditionally, these traits have led to strategies involving release branches and release managers. Our app releases started out this way, but we learned quickly that they didn’t feel very Etsy. And so we set out to change them.
Jen and Sasha
We were the release managers. Jen managed the Sell on Etsy apps, and I managed the Etsy apps. We were responsible for all release stage transitions, maintaining the schedule, and managing all the communications around releases. We were also responsible for resolving conflicts and coordinating cross-team resources in cases of bugs and urgent blockers to release.
Ready to Ship
A key part of our job was making sure everyone knew what they’re supposed to do and when they’re supposed to do it. The biggest such checkpoint is when a release branches — this is when we create a dedicated branch for the release off master, and master becomes the next release. This is scheduled and determines what changes make it into production for a given release. It’s very important to make sure that those changes are expected, and that they have been tested.
For Jen and me, it would’ve been impossible to keep track of the many changes in a release ourselves, and so it was our job to coordinate with the engineers that made the actual changes and make sure those changes were expected and tested. In practice, this meant sending emails or messaging folks when approaching certain checkpoints like branching. And likewise, if there were any storm warnings (such as show-stopping bugs), it was our responsibility to raise the flag to notify others.
Then Jen left Etsy for another opportunity, and I became a single-point-of-failure and a gatekeeper. Every release decision was funneled through me, and I was the only person able to make and execute those decisions.
I was overwhelmed. Frustrated. I was worried I’d be stuck navigating iTunes Connect and Google Play, and sending emails. And frankly, I didn’t want to be doing those things. I wanted those things to be automated. Give me a button to upload to iTunes Connect, and another to begin staged rollout on Google Play. Thinking about the ease of deploying on web just filled me with envy.
This time wasn’t easy for engineers either. Even back when we had two release managers, from an engineer’s perspective, this period of app releases wasn’t transparent. It was difficult to know what phase of release we were in. A large number of emails was sent, but few of them were targeted to those that actually needed them. We would generically send emails to one big list that included all four of our apps. And all kinds of emails would get sent there. Things that were FYI-only, and also things that required urgent attention. We were on the path to alert-fatigue.
All of this meant that engineers felt more like they were in the cargo hold, rather than in the cockpit. But that just didn’t fit with how we do things for web. It didn’t fit with our philosophy for deployment. We didn’t like it. We wanted something better, something that placed engineers in front of the tiller.
So we built a vessel that coordinates the status, schedule, communications, and deploy tools for app releases. Here’s how Ship helps:
- Keeps track of who committed changes to a release
- Sends Slack messages and emails to the right people about the relevant events
- Manages the state and schedule of all releases
It’s hard to imagine all of that abstractly, so here’s an example:
- A cron moves the release into “Testing” and generates testing build v18.104.22.168.
- Ship is notified of this and sends an email to Alicia with the build.
- Alicia installs the build, verifies her changes, and tells Ship she’s ready.
- The final testing finds no show-stopping issues
- A cron submits v4.64.0 to iTunes Connect for review.
- A cron checks iTunes Connect for the review status of this release, and updates Ship that it’s been approved.
- Ship emails Alicia and others letting them know the release is approved.
- A cron releases v4.64.0.
(Had Alicia committed to our Android app, a cron would instead begin staged rollout on Google Play.)
- Ship emails Alicia and others letting them know the release is out in production.
- Ship emails a report of top crashes to all the engineers in the release (including Alicia)
Before Ship, all of these components above would’ve been performed manually. But you’ll notice that release managers are missing from the above script; have we replaced release managers with all the automations in Ship?
Partially. Ship has a feature where each release is assigned a driver.
This driver is responsible for a bunch of things that we couldn’t or shouldn’t automate. Here’s what they’re responsible for:
- Schedule changes
- Shepherding ‘ready to ships’ from other engineers
- Investigating showstopping bugs before release
Everything else? That’s automated. Branching, release candidate generation, submission to iTunes Connect — even staged rollout on Google Play! But, we’ve learned from automation going awry before. By default, some things are set to manual. There are others for which Ship explicitly does not allow automation, such as continuing staged rollout on Google Play. Things like this should involve and require human interaction. For everything else that is automated, we added a failsafe: at any time, a driver can disable all the crons and take over driving from autopilot:
When a driver wants to do something manually, they don’t need access to iTunes Connect or Google Play, as each of these things is made accessible as a button. A really nice side effect of this is that we don’t have to worry about provisioning folks for either app store, and we have a clear log of every release-related action taken by drivers.
Drivers are assigned once a release moves onto master, and are semi-randomly selected based on previous drivers and engineers that have committed to previous releases. Once assigned, we send them an onboarding email letting them know what their responsibilities are:
Ready to Ship Again
The driver can remain mostly dormant until the day of branching. A couple hours before we branch, it’s the driver’s responsibility to make sure that all the impacting engineers are ready to ship, and to orchestrate efforts when they’re not. After we’re ready, the driver’s responsibility is to remain available as a point-of-contact while final testing takes place. If an issue comes up, the driver may be consulted for steps to resolve.
And then, assuming all goes well, comes release day. The driver can opt to manually release, or let the cron do this for them — they’ll get notified if something goes wrong, either way. Then a day after we release, the driver looks at all of our dashboards, logs, and graphs to confirm the health of the release.
But not all releases are planned. Things fail, and that’s expected. It’s naïve to assume some serious bug won’t ship with an app release. There’s plenty of things that can and will be the subject of a post-mortem. When one of those things happens, any engineer can spawn a bugfix release off the most-recently-released mainline release.
The engineer that requests this bugfix gets assigned as the driver for that release. Once they branch the release, they make the necessary bugfixes (others can join in to add bugfixes too, if they coordinate with the driver) in the release’s branch, build a release candidate, test it, and get it ready for production. The driver can then release it at will.
Releases are actually quite complicated.
It starts off as an abstract thing that will occur in the future. Then becomes a concrete thing actively collecting changes via commits on master in git. After this period of collecting commits, the release is considered complete and moves into its own dedicated branch. The release candidate is then built from this dedicated branch, which then gets thoroughly tested, and moved into production. The release itself then concludes as an unmerged branch.
Once a release branches, the next future release moves onto master. Each release is its own state machine, where the development and branching states overlap between successive releases.
Notifications: Slack and Email
Plugged into the output of Ship are notifications. Because there are so many points of interest en route to production, it’s really important that the right people are notified at the right times. So we use the state machine of Ship to send out notifications to engineers (and other subscribers) based on how much they asked to know, and how they impacted the release. We also allow anyone to sign up for notifications around a release. This is used by product managers, designers, support teams, engineering managers, and more. Our communications are very targeted to those that need or want them.
In terms of what they asked to know, we made it very simple to get detailed emails about state changes to a release:
In terms of how they impacted the release, we need to get that data from somewhere else.
We mentioned data Ship receives from outside sources. At Etsy, we use GitHub for our source control. Our apps have repos per-platform (Android and iOS). In order to keep Ship’s knowledge of releases up-to-date, we set up GitHub Webhooks to notify Ship whenever changes are pushed to the repo. We listen for two changes in particular: pushes to master, and pushes to any release branch.
When Ship gets notified, it iterates through the commits and uses the author, changed paths, and commit message to determine which app (buyer or seller) the commit affects, and which release we should attribute this change to. Ship then takes all of that and combines it into a state that represents every engineer’s impact on a given release. Is that engineer “user-impacting” or “dark” (our term for changes that aren’t live)? Ship then uses this state to determine who is a member of what release, and who should get notified about what events.
Additionally, at any point during a release, an engineer can change their status. They may want to do this if they want to receive more information about a release, or if Ship misunderstood one of their commits as being impacting to the release.
Everything up until has explained how Ship keeps track of things. But there’s been no explanation for how some of the automated actions affecting the app repo or things outside Etsy occur.
We have a home-grown tool for managing deploys called Deployinator, and we added app support. It can now perform mutating interactions with the app repos, as well as all the deploy actions related to Google Play and iTunes Connect. This is where we build the testing candidates, release candidate, branch the release, submit to iTunes Connect, and much more.
We opted to use Deployinator for a number of reasons:
- Etsy engineers are already familiar with it
- It’s our go-to environment for wrapping up a build process into a button
- Good for things that need individual run logs, and clear failures
In our custom stack, we have crons. This is how we branch on Tuesday evening (assuming everyone is ready). This is where we interface with Google Play and iTunes Connect. We make use of Google Play’s official API in a custom python module we wrote, and for iTunes Connect we use Spaceship to interface with the unofficial API.
The end result of Ship is that we’ve distributed release management. Etsy no longer has any dedicated release managers. But it does have an engineer who used to be one — and I even get to drive a release every now and then.
People cannot be fully automated away. That applies to our web deploys, and is equally true for app releases. Our new process works within that reality. It’s unique because it pushes the limit of what we thought could be automated. Yet, at the same time, it empowers our app engineers more than ever before. Engineers control when a release goes to prod. Engineers decide if we’re ready to branch. Engineers hit the buttons.
And that’s what Ship is really about. It empowers our engineers to deliver the best apps for our users. Ship puts engineers at the helm.
When a user searches for an item on Etsy, they don’t always type what they mean. Sometimes they type the query jewlery when they’re looking for jewelry; sometimes they just accidentally hit an extra key and type dresss instead of dress. To make their online shopping experience successful, we need to identify and fix these mistakes, and display search results for the canonical spelling of the actual intended query. With this motivation in mind, and through the efforts of the Data Science, Linguistic Tools, and the Search Infrastructure teams, we overhauled the way we do spelling correction at Etsy late last year.
The older service was based on a static mapping of misspelled words to their respective corrections, which was updated only infrequently. It would split a search query into its tokens, and replace any token that was identified as a misspelling with its correction. Although the service allowed a very fast way of retrieving a correction, it was not an ideal solution for a few reasons:
- It could only correct misspellings that had already been identified and added to the map; it would leave previously unseen misspellings uncorrected.
- There was no systematic process in place to update the misspelling-correction pairs in the map.
- There was no mechanism of inferring the word from the context of surrounding words. For instance, if someone intended to type butter dish, but instead typed buttor dish, the older system would correct that to button dish since buttor is a more commonly observed misspelling for button at Etsy.
- Additionally, the older system would not allow one-to-many mappings from misspellings to corrections. Storing both buttor -> button and buttor -> butter mappings simultaneously would not be possible without modifying the underlying data structure.
We, therefore, decided to upgrade the spelling correction service to one based on a statistical model. This model learns from historical user data on Etsy’s website and uses the context of surrounding words to offer the most probable correction.
Although this blog post outlines some aspects of the infrastructure components of the service, its main focus is on describing the statistical model used by it.
We use a model that is based upon the Noisy Channel Model, which was historically used to infer telegraph messages that got distorted over the line. In the context of a user typing an incorrectly spelled word on Etsy, the “distortion” could be from accidental typos or a result of the user not knowing the correct spelling. In terms of probabilities, our goal is to determine the probability of the correct word, conditional on the word typed by the user. That probability, as per Bayes’ rule, is proportional to the product of two probabilities:
We are able to estimate the probabilities on the right-hand side of the relation above, using historical searches performed by our users. We describe this in more detail below.
To extend this idea to multi-word phrases, we use a simple Hidden Markov Model (HMM) which determines the most probable sequence of tokens to suggest as a spelling correction. Markov refers to being able to predict the future state of the system based solely on its present state. Hidden refers to the system having hidden states that we are trying to discover. In our case, the hidden states are the correct spellings of the words the user actually typed.
The main components of an HMM are explained via the figure below. Assume, for the purpose of this post, that the user searched for Flower Girl Baske when they intended to search for Flower Girl Basket. The main components of an HMM are then as follows:
- Observed states: the words explicitly typed by the user, Flower Girl Baske (represented as circles).
- Hidden states: the correct spellings of the words the user intended to type, Flower Girl Basket (represented as squares).
- Emission probability: the conditional probability of observing a state given the hidden state.
- Transition probability: the probability of observing a state conditional upon the probability of the immediately previous observed state, like, for instance, in the figure below, the probability of transitioning from Girl to Baske, i.e., P(Baske|Girl). The fact that this probability does not depend on the probability of having observed Flower, the state that precedes Girl, illustrates the Markov property of the model.
Once the spelling correction service receives the query, it first splits the query into three tokens, and then suggests possible corrections for each of the tokens as shown in the figure below.
In each column, one of the correction possibilities represents the true hidden state of the token that was typed (emitted) by the user. The most probable sequence can be thought of as identifying the true hidden state for each token given the context of the previous token. To accomplish this, we need to know probabilities from two distributions: first, the probability of typing a misspelling given the correct spelling of the word, the emission probability, and, second, the probability of transitioning from one observed word to another, the transition probability.
Once we are able to calculate all the emission probabilities, and the transition probabilities needed, as described in the sections below, we can determine the most probable sequence using a common dynamic programming algorithm known as the Viterbi Algorithm.
The emission probability is estimated through an Error Model, which is a statistical model created by inferring historical corrections provided by our own users while making searches on Etsy. The heuristic we used is based on two conditions: a user’s search being followed immediately by another search similar to the first, and with the second search then leading to a listing click. We align these tokens to each other in a way that minimizes the number of edits needed on the misspelled query to transform it into the corrected second query.
We then make aligned splits on the characters, and calculate counts and associated conditional probabilities. We generate character level probabilities for four types of operations: substitution, insertion, deletion, and transposition. Of these, insertion and deletion also require us to also keep track of the context of neighboring letters — the probability of adding a letter after another, or removing a letter appearing after another, respectively.
To continue with the earlier example, when the user typed baske, the emitted token, they were probably trying to type basket, the hidden state for baske, which corresponds to the probability P(Baske|Basket). Error probabilities for tokens, such as these, are calculated by multiplying the character-level probabilities which are assumed to be independent. For instance, the probability of correcting the letter e by appending the letter t to it is given by:
Here, the numerator is the number of times in our dataset where we see an e being corrected to an et, and the denominator is the number of corrections where any letter was appended after e. Since we assume that the probability associated with a character remaining unchanged is 1, in our case, the probability P(Baske|Basket) is solely dependent on the insertion probability P(et|e).
Transition probabilities are estimated through a language model which is determined by calculating the unigram and bigram token frequencies seen in our historical search queries.
A specific instance of that, from the chosen example, would be the probability of going from one token in the Flower (first) column, say Flow, to a token, say Girl, in the Girl (second) column, which is represented as P(Girl|Flow). We are able to determine that probability from the following ratio of bigram token counts to unigram token counts:
The error models and language models are generated through Scalding-powered jobs that run on our production hadoop cluster.
Viterbi Algorithm Heuristic
To generate inferences using these models, we employ the Viterbi Algorithm to determine the optimal query, i.e., sequence of tokens to serve as a spelling correction. The main idea is to present the most probable sequence as the spelling correction. The iterative process goes, as per the figure above, from the first column of tokens to the last. At the nth iteration, we have the sequence with the maximum probability available from the previous iteration, and we choose the token from the nth column which increases the maximum probability from the previous iteration the most.
Let’s explain this more concretely by describing the third iteration for our main example: assume that we already have Flower girl as the most probable phrase at the second iteration. Now we pick the token from the third column that corresponds to the maximum of the following transition probability and emission probability products:
We can see from the figure above that the transition probability from Girl to Basket is high enough to overcome the lower emission probability of someone typing Baske when they mean to type Basket, and that, consequently, Basket is the word we want to add to the existing Flower Girl phrase.
We now know that Flower Girl Basket is the most probable correction as predicted by our models. The decision about whether we suggest this correction to the user, and how we suggest it, is made on the basis of its confidence score. We describe that process a little later in the post.
Hyperparameters and Model Training
We added a few hyperparameters to the bare bones model we have described so far. They are as follows:
- We added an exponent to the emission probability because we want the model to have some flexibility in weighting the error model component in relation to the language model component.
- Our threshold for offering a correction is a linear function of the number of tokens in the query. We, therefore, have a slope and the intercept of the threshold as a pair of hyperparameters.
- Finally, we also have a parameter corresponding to the probability of words our language model does not know about.
Our training data set consists of two types of examples: the first kind are of the form misspelling -> correct spelling, like jewlery -> jewelry, while the second kind are of the form correct spelling -> _, like dress -> _. The second type corresponds to queries that are already correct, and therefore don’t need a correction. This setup enables our model to distinguish between the populations of both correct and incorrect spellings, and to offer corrections for the latter.
We tune our hyperparameters via 10-fold cross-validation using a hill-climbing algorithm that maximizes the f-score on our training data set. The f-score is the harmonic mean of the precision and the recall. Precision measures how many of the corrections offered by the system are correct and, for our data set, is defined as:
Recall is a measure of coverage — it tries to answer the question, “how many of the spelling corrections that we should be offering are we offering?”. It is defined as:
Optimizing on the f-score is allows us to increase coverage of the spelling corrections we offer while keeping the number of invalid corrections offered in check.
Serving the Correction
We serve three types of corrections at Etsy based on the confidence we have in the correction. We use the following odds-ratio as our confidence score:
If the confidence is greater than a certain threshold, we display search results corresponding to the correction, along with a “Search instead of” link to see the results for the original query instead. If the confidence is below a certain threshold, we offer results of the original query, with a “Did you mean” option to the suggested correction. If we don’t have any results for the original query, but do have results for the corrected query, we display results for the correction independent of the confidence score.
Spelling Correction Service
With the new service in place, when a search is made on Etsy, we fire off requests to both the new Java-based spelling service and the existing search service. The response from the spelling service includes both the correction and its confidence score, and how we display the correction, based on the confidence score, is described in the previous section. For high-confidence corrections, we make an additional request for search results with the corrected query, and display those results if their count is greater than a certain threshold.
When the spelling service receives a query, it first splits it into its constituent tokens. It then makes independent suggestions for each of the tokens, and, finally, strings together from those suggestions a sequence of the most probable tokens. This corrected sequence is sent back to the web stack as a response to the spelling service request.
The correction suggestions for each token are generated by Lucene’s DirectSpellChecker class which queries an index generated from tokens that are generated through a Hadoop job. The Hadoop job counts tokens from historical queries, rejecting any tokens that appear too infrequently or those that appear only in search queries with very few search results. We have a daily cron that ships the tokens, along with the generated Error and Language model files, from HDFS to the boxes that host the spelling correction service. A periodic rolling restart of the service across all boxes ensures that the freshest models and index tokens are picked up by the service.
The original implementation of the model was limited, in the sense, that it could not suggest corrections that were, at the token level, more than two edits away from the original token. We have subsequently implemented an auxiliary framework of suggesting correction tokens that fixes this issue.
Another limitation was related to splitting and compounding of tokens in the original query to make suggestions. For instance, we were not able to suggest earring as a suggestion for ear ring. Some of our coworkers are working on modifying the service to accommodate corrections of this type.
Although we do use supervised learning to train the hyperparameters of our models, since launching the service we have additional inputs from our users which we can improve our models with. Specifically, users clicking the “Did you mean” link on our low-confidence corrections results page provides us with explicit positive feedback, while clicks on the “Search instead for” link on our high-confidence corrections results page provides us with negative feedback. The next major evolution of the model would be to explicitly use this feedback to improve corrections.
(This project was a collaboration between Melanie Gin and Benjamin Russell on the Linguistic Tools team who built out the infrastructure and front-end; Caitlin Cellier and Mohit Nayyar on the Data Science team who worked on implementing the statistical model; and Zhi-Da Zhong on the Search Infrastructure team who consulted on infrastructure design. A special thanks to Caitlin whose presentation some of these figures are copied from.)
In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools.
The ability to point a custom domain to a Pattern site is an especially popular feature; many Pattern sites use their own domain name, either registered directly on the Pattern dashboard, or linked to Pattern from a third-party registrar.
At launch, Pattern shops with custom domains were served over HTTP, while checkouts and other secure actions happened over secure connections with Etsy.com. This model isn’t ideal though; Google ranks pages with SSL slightly higher, and plans to increase the bump it gives to sites with SSL. That’s a big plus for our sellers. Plus, it’s the year 2017 – we don’t want to be serving pages over boring old HTTP if we can help it … and we really, really like little green lock icons.
In this post we’ll be looking at some interesting challenges you run into when building a system designed to serve HTTPS traffic for hundreds of thousands of domains.
How to HTTPS
First, let’s talk about how HTTPS works, and how we might set it up for a single domain.
Let’s not talk that much about how HTTPS works though, because that would take a long time. Very briefly: by HTTPS we mean HTTP over SSL, and by SSL we mean TLS. All of this is a way for your website to communicate securely with clients. When a client first begins communicating with your site, you provide them with a certificate, and the client verifies that your certificate comes from a trusted certificate authority. (This is a gross over-simplification, here’s a very interesting post with more detail.)
Suffice to say, if you want HTTPS, you need a certificate, and if you want a certificate you need to get one from a certificate authority. Until fairly recently, this is the point where you had to open up your wallet. There were a handful of certificate authorities, each with an array of confusing options and pricey up-sells.
You pick whichever feels right to you, you get out your credit card, and you wind up with a certificate that’s valid for one to three years. Some CA’s offer API’s, but three years is a relatively long time, so chances are you do this manually. Plus, who even knows which cell in which SSL pricing matrix gets you API access?
From here, if you’re managing a limited number of domains, typically what you do is upload your certificate and private key to your load balancer (or CDN) and it handles TLS termination for you. Clients communicate with your load balancer over HTTPS using your domain’s public certificate, and your load balancer makes internal requests to your application.
And that’s it! Your clients have little green padlocks in their address bars, and your application is serving payloads just like it used to.
More certificates, more problems
Manually issuing certificates works well enough if your application is served off a handful of top-level domains. If you need certificates for lots and lots of TLDs, it’s not going to work. Have we mentioned that Pattern is available for the very reasonable price of $15 per month? A real baseline requirement for us is we can’t pay more for a certificate than someone pays for a year of Pattern. And we certainly don’t have time to issue all those certificates by hand.
Until fairly recently, our only options at this point would have been 1) do a deal with one of the big certificate authorities that offers API access, or 2) become our own certificate authority. These are both expensive options (we don’t like expensive options).
Let’s Encrypt to the rescue
Luckily for us, in April of 2016, the Internet Security Research Group (ISRG), which includes founders from the Mozilla Foundation and the Electronic Frontier Foundation, launched a new certificate authority named Let’s Encrypt.
The ISRG has also designed a communication protocol, Automated Certificate Management Environment (ACME), that covers the process of issuing and renewing certificates for TLS termination. Let’s Encrypt provides its service for free, and exclusively via an API that implements the ACME protocol. (You can read about it in more detail here.)
This is great for our project: we found a certificate authority we feel really good about, and since it implements a freely published protocol, there are great open source libraries out there for interacting with it.
Building out certificate issuance
Let’s re-examine our problem now that we have a good way to get certificates:
- We have a large quantity of custom domains attached to existing Pattern sites; we need to generate certificates for all of those.
- We have a constant stream of new domain registrations for new Pattern sites; we’ll need to get new certificates on a rolling basis.
- Let’s Encrypt certificates last 90 days; we need to renew our SSL certificates fairly frequently.
All of these problems are solvable now! We chose an open source ACME client library for PHP called AcmePHP. Another exciting thing about Let’s Encrypt using an open protocol, there are open source server implementations as well as client implementations. So we were able to spin up an internal equivalent of Let’s Encrypt called boulder and do our development and testing inside our private network, without worrying about rate limits.
The service we built out to handle this ACME logic, store the certificates we receive, and keep track of which domains we have certificates for we named CertService. It’s a service that deals with certificates, you see. CertService communicates with Let’s Encrypt, and also exposes an API for other internal Etsy services. We’ll go into more detail on that later in this post.
TLS termination for lots and lots of domains
Now that we’ve build out a service that can have certificates issued, we need a place to put them, and we need to use them to do TLS termination for HTTPS requests to Pattern custom domains.
We handle this for *.etsy.com by putting our certificates directly onto our load balancers and giving them to our CDN’s. In planning this project, we thought about hundreds of thousands of certificates, each with a 90 day lifetimes. If we did a good job spacing out issuing and renewing our certificates, that would add up to thousands of daily write operations to our load balancers and CDN’s. That’s not a rate we were comfortable with; load balancer and CDN configuration changes are relatively high risk operations.
What we did instead, is create a pool of proxy servers, and use our load balancer to distribute HTTPS traffic to them. The proxy hosts handle the client SSL termination, and proxy internal requests to our web servers, much like the load balancer does for www.etsy.com.
Our proxy hosts run Apache and we leverage mod_ssl and mod_proxy to do TLS termination and proxy requests. To preserve client IP addresses, we make use of the PROXY protocol on our load balancer and mod_proxy_protocol on the proxy hosts. We’re also using mod_macro to avoid ever having to write out hundreds of thousands of virtual hosts declarations for hundreds of thousands of domains. All put together, it looks something like this:
ProxyPass / https://internal-web-vip/
ProxyPassReverse / https://internal-web-vip/
<Macro VHost $domain>
Use VHost custom-domain-1.com
Use VHost custom-domain-n.com
To connect all this together, our proxy hosts periodically query CertService for a list of recently modified custom domains. Each host then 1) fetches the new certificates from CertService, 2) writes them to disk, 3) regenerates a config like the one above, and 4) does a graceful restart of Apache. These restarts are staggered across our proxy pool so all but one of the hosts is available and receiving requests from the load balancer at any give time (fingers crossed).
How do we securely store lots of certificates?
Now that we’ve figured out how to programmatically request and renew SSL certificates via LetsEncrypt, we need to store these certificates in a secure way. To do this, there are some guarantees we need to make:
- Private keys are stored in a database segmented from other types of data
- Private keys encrypted at rest and never leave CertService in plaintext
- SSL key pair generation and LetsEncrypt communications take place only on trusted hosts
- Private keys can be retrieved only by the SSL terminating hosts
Guarantee #1 is a no-brainer. If an attacker were to compromise a datastore containing thousands of SSL private keys in plaintext, they would be be able to intercept critical data being sent to thousands of custom domains. Since security is about raising costs for an attacker – that is, making it harder for an attacker to succeed – we employ a number of techniques to secure our keys. Our first layer of defense is at an infrastructure level: private keys are stored in a MySQL database away from the rest of the network. We use iptables to limit who can connect to the MySQL server, and given that CertService is the only client that needs access, the scope is really narrow. This vastly reduces attack surface, especially in cases where an attacker is looking to pivot from another compromised server on the network. Iptables is also then used to lock down who can communicate to the CertService API; adding constraints to connectivity on top of a secure authentication scheme makes retrieving certificates that much more difficult. That addresses Guarantee #4.
Now that we’ve locked down access to the database, we need to make sure they’re stored encrypted. For this, we make use of a concept known as a hybrid cryptosystem. Hybrid cryptosystems, in a nutshell, combine asymmetric (public-key crypto) and symmetric cryptosystems. If you’re familiar with SSL, much of how we handle crypto here is analogous to Session Keys.
At the start of this process, we have two pieces of data: the SSL private key and its corresponding public key – the certificate. We don’t particularly care about the certificate since that is public by definition. We start by generating a domain specific AES-256 key and encrypt the SSL private key. This only technically addresses the issue of not having plaintext on disk; the encrypted SSL private key is stored right next to the AES key, which can be used to both encrypt and decrypt. An attacker who could steal the encrypted keys could also steal the AES key. To address this, we encrypt the AES key with a CertService public key. Now we have an encrypted SSL private key (encrypted with the AES-256 key) and an encrypted AES key (encrypted with CertService’s RSA-2048 public key). Now not only are keys truly stored encrypted on disk, they also cannot be decrypted at all on CertService. This means if an attacker were to break CertService’s authentication scheme, the most they would receive is an encrypted SSL private key; they would still need the CertService private key – available only on the SSL terminating hosts – to decrypt it. Now we’ve fully taken care of Guarantee #2.
The only Guarantee that remains is #3. If key generation were compromised, an attacker would be able to grab private keys before they were encrypted and stored. If LetsEncrypt communication were compromised, an attacker could use our keys to generate certificates for domains we’ve already authorized (they could technically authorize new ones, but that would be significantly more difficult) or even revoke certificates. Both of these cases would render the entire system untrustworthy. Instead, we limit this functionality to CertService and expose it as an API; that way, if the web server handling Pattern requests were broken into, the attacker would not be able to affect critical LetsEncrypt flows.
One of our stretch goals is to look into deploying HSMs. If there are bugs in the underlying software, the integrity of the entire system could be compromised thus voiding any guarantees we try to keep. While bugs are inevitable, moving critical cryptographic functions into secure hardware will mitigate their impact.
No cryptosystem is perfect, but we’ve reached our goal of significantly increasing attacker cost. On top of that, we’ve supplemented the cryptosystem with our usual host based alerting and monitoring. So not only will an attacker have to jump through several hoops to get those SSL key pairs, they will also have to do it without being detected.
After the build
With all of that wrapped up, we had a system to issue large numbers of certificates, securely store this data, and terminate TLS requests with it. At this point Etsy brought in a third-party security firm to do a round of penetration testing. This process finished up without finding any substantive security weaknesses, which gives us an added level of confidence in our system.
Once we’ve gained enough confidence, we will enable HSTS. This should be the final goal of any SSL rollout as it forces browsers to use encryption for all future communication. Without it, downgrade attacks could be used to intercept traffic and hijack session cookies.
Every Pattern site and linked domain now has a valid certificate stored in our system and ready to facilitate secure requests. This feature is rolled out across Pattern and all Pattern traffic is now pushed over HTTPS!
(This project was a collaboration between Ram Nadella, Andy Yaco-Mink and Nick Steele on the Pattern team; Omar Ahmed and Ken Lee from Security; and Will Gallego from Operations. Big thanks to Dennis Olvany, Keyur Govande and Mike Adler for all their help.)
We recently ran a successful split-testing title tag experiment to improve our search engine optimization (SEO), the results and methodology of which we shared in a previous Code as Craft post. In this post, we wanted to share some of our further learnings from our SEO testing. We decided to double down on the success of our previous experiment by running a series of further SEO experiments, one of which included changes to our:
- Title tags
- Meta descriptions
We found three surprising results:
- Shortening our title tags showed improved performance in terms of visits (and other key metrics)
- Meta descriptions had a statistically significant impact on organic search traffic
- H1s had a statistically significant impact on organic search traffic
Some notes on our methodology
For full details on the setup and methodology of our SEO split testing methodology, please see the previous Code as Craft post referenced above.
For this particular test, we used six unique treatments and two control groups. The two control groups remained consistent with each other before and after the experiment.
To derive the exact estimated causal changes effected in visits, we most often used Causal Impact modeling as implemented in the CausalImpact package by Google to standardize the test buckets against the control buckets. In some cases, the difference in differences method was used because it was more reliable in estimating effect sizes when working with strong seasonality swings.
This experiment was also impacted by strong seasonality and external event effects related to the holidays, sports events and the US elections. The statistical modeling in our final experiment analysis was adjusted for these effects to ensure accurate measurement of the causal effects of the test variants.
Takeaway #1: Short Title Tags Win Again
The results of this experiment aligned with the findings from our previous SEO title tag experiments, where it appeared shorter title tags drove more visits. We have so far validated our hypothesis that shorter title tags perform better in title tags in multiple separate experiments including many different variations and now feel quite confident that shorter title tags perform better (as measured by organic traffic) for Etsy. We hypothesize that this effect could be taking place through a number of different causal mechanisms:
- Lower Levenshtein distance and/or higher percentage match to target search queries rewarded by Google’s search algorithm and thereby improving Etsy’s rankings in Google search results. Per Wikipedia: “the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.”
- Shorter title tags, consisting only of the target search keyword, appear more relevant/enticing to search users
Takeaway #2: Meta Descriptions Matter
We found that changes in the meta description of a page can lead to statistically significant changes to visits. It appeared that longer and descriptive meta descriptions performed better and that conversely, shorter, terse meta descriptions performed worse. We hypothesize that longer meta descriptions might perform better via two possible causal mechanisms:
- Longer meta descriptions take up more real estate in a search results page, improving CTRs
- Longer meta descriptions give an appearance of more authority or more content, improving CTRs
Takeaway #3: H1s matter
We found that an H1 change can have a statistically significant impact on organic search traffic. However, changes in the H1 section of a page appear to interact with changes in title tags in hard to predict ways. For example, in this experiment, a title tag change in a certain variant increased visits. However, when the title tag change was combined with an H1 change, the positive effect of the title tag change was dulled, even though an H1 change by itself in a different variant led to slight increases in visits. This highlights the importance of SEO testing before rolling out even seemingly minor changes to Etsy pages.
Accounting For Unexpected Events
The Donald Trump Effect
One example of an event we had to control for (among a number of others) was the “Donald Trump effect”. We observed a large bucket skew on November 9 and 10, 2016 in one of our test groups. Upon investigation, it was found that the skew was due to large increases (+2000% to +5180%) in daily visits to pages related to “Donald Trump” the day after the US Presidential elections. Although the spikes in traffic to these pages were short lived, lasting only several days, they nevertheless did have the potential to unduly bias or reduce the statistical significance of the results of our experiment. These pages were therefore removed and controlled for when conducting the causal impact analyses for this experiment.
Our SEO experiments illustrate the importance of running SEO tests before making changes to a web page and continue to offer surprising results. However, it is important to note that the results of our experiments are only true for Etsy and do not necessarily reflect best practices for sites generally. We would therefore encourage everyone to discover the strategy that works best for their website through rigorous SEO testing.
In 2012, I wrote a post for the Code As Craft blog about how we approach learning from accidents and mistakes at Etsy. I wrote about the perspectives and concepts behind what is known (in the world of Systems Safety and Human Factors) as the New View on “human error.” I also wrote about what it means for an organization to take a different approach, philosophically, to learn from accidents, and that Etsy was such an organization.
That post’s purpose was to conceptually point in a new direction, and was, necessarily, void of pragmatic guidance, advice, or suggestions on how to operationalize this perspective. Since then, we at Etsy have continued to explore and evolve our understanding of what taking this approach means, practically. For many organizations engaged in software engineering, the group “post-mortem” debriefing meeting (and accompanying documentation) is where the rubber meets the road.
Many responded to that original post with a question:
“Ok, you’ve convinced me that we have to take a different view on mistakes and accidents. Now: how do you do that?”
As a first step to answer that question, we’ve developed a new Debriefing Facilitation Guide which we are open-sourcing and publishing.
We wrote this document for two reasons.
The first is to state emphatically that we believe a “post-mortem” debriefing should be considered first and foremost a learning opportunity, not a fixing one. All too often, when teams get together to discuss an event, they walk into the room with a story they’ve already rationalized about what happened. This urge to point to a “root” cause is seductive — it allows us to believe that we’ve understood the event well enough, and can move on towards fixing things.
We believe this view is insufficient at best, and harmful at worst, and that gathering multiple diverse perspectives provides valuable context, without which you are only creating an illusion of understanding. Systems safety researcher Nancy Leveson, in her excellent book Engineering A Safer World, has this to say on the topic:
“An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate events: A narrow focus on operator actions, physical component failures, and technology may lead to ignoring some of the most important factors in terms of preventing future accidents. The whole concept of ‘root cause’ needs to be reconsidered.”
In other words, if we don’t pay attention to where and how we “look” to understand an event by considering a debriefing a true exploration with open minds, we can easily miss out on truly valuable understanding. How and where we pay attention to this learning opportunity begins with the debriefing facilitator.
The second reason is to help develop debriefing facilitation skills in our field. We wanted to provide practical guidance for debriefing facilitators as they think about preparing for, conducting, and navigating a post-event debriefing. We believe that organizational learning can only happen when objective data about the event (the type of data that you might put into a template or form) is placed into context with the subjective data that can only be gleaned by the skillful constructing of dialogue with the multiple, diverse perspectives in the room.
The Questions Are As Important As The Answers
In his book “Pre-Accident Investigations: Better Questions,” Todd Conklin sheds light on the idea that the focus on understanding complex work is not in the answers we assert, but in the questions we ask:
“The skill is not in knowing the right answer. Right answers are pretty easy. The skill is in asking the right question. The question is everything.”
What we learn from an event depends on the questions we ask as facilitators, not just the objective data you gather and put into a document. It is very easy to assume that the narrative of an accident can be drawn up from one person’s singular perspective, and that the challenge is what to do about it moving forward.
We do not believe that to be true. Here’s a narrative taken from real debriefing notes, generalized for this post:
“I don’t know,” the engineer said, when asked what happened. “I just wasn’t paying attention, I guess. This is on me. I’m sorry, everyone.” The outage had lasted only 9 minutes, but to the engineer it felt like a lifetime. The group felt a strange but comforting relief, ready to fill in the incident report with ‘engineer error’ and continue on with their morning work.
The facilitator was not ready to call it a ‘closed case.’
“Take us back to before you deployed, Phil…what do you remember? This looks like the change you prepped to deploy…” The facilitator displayed the code diff on the big monitor in the conference room.
Phil looked closely to the red and green lines on the screen and replied “Yep, that’s it. I asked Steve and Lisa for a code review, and they both said it looked good.” Steve and Lisa nod their heads, sheepishly.
“So after you got the thumbs-up from Steve and Lisa…what happened next?” the facilitator continued.
“Well, I checked it in, like I always do,” Phil replied. “The tests automatically run, so I waited for them to finish.” He paused for a moment. “I looked on the page that shows the test results, like this…” Phil brought up the page in his browser, on the large screen.
“Is that a new dashboard?” Lisa asked from the back of the room.
“Yeah, after we upgraded the Jenkins install, we redesigned the default test results page to the previous colors because the new one was hard to read,” replied Sam, the team lead for the automated testing team.
“The page says eight tests failed.” Lisa replied. Everyone in the room squinted.
“No, it says zero tests failed, see…?” Phil said, moving closer to the monitor.
Phil hit control-+ on his laptop, increasing the size of the text on the screen. “Oh my. I swear that it said zero tests failed when I deployed.”
The facilitator looked at the rest of the group in the conference room. “How many folks in the room saw this number as a zero when Phil first put it up on the screen?” Most of the group’s hands went up. Lisa smiled.
“It looked like a zero to me too,” the facilitator said.
“Huh. I think because this small font puts slashes in its zeros and it’s in italics, an eight looks a lot like a zero,” Sam said, taking notes. “We should change that.”
As a facilitator, it would be easy to stop asking questions at the mea culpa given by Phil. Without asking him to describe how he normally does his work, by bringing us back to what he was doing at the time, what he was focused on, what led him to believe that deploying was going to happen without issue, we might never have considered that the automated test results page could use some design changes to make the number of failed tests clearer, visually.
In another case, an outage involved a very complicated set of non-obvious interactions between multiple components. During the debriefing, the facilitator asked each of the engineers who designed the systems and were familiar with the architecture to draw on a whiteboard the picture that they had in their minds when they think about how it all is expected to work.
When seen together, each of the diagrams from the different engineers painted a fuller picture of how the components worked together than if there was only one engineer attempting to draw out a “comprehensive” and “canonical” diagram.
The process of drawing this diagram together also brought the engineers to say aloud what they were and were not sure about, and that enabled others in the room to repair those uncertainties and misunderstandings.
Both of these cases support the perspective we take at Etsy, which is that debriefings can (and should!) serve multiple purposes, not just a simplistic hunt for remediation items.
By placing the focus explicitly on learning first, a well-run debriefing has the potential to educate the organization on not just what didn’t work well on the day of the event in question, but a whole lot more.
The key, then, to having well-run debriefings is to start treating facilitation as a skill in which to invest time and effort.
Influencing the Industry
We believe that publishing this guide will help other organizations think about the way that they train and nurture the skill of debriefing facilitation. Moreover, we also hope that it can continue the greater dialogue in the industry about learning from accidents in software-rich environments.
In 2014, knowing how critical I believe this topic to be, someone alerted me to the United States Digital Service Play Books repo on Github, and an issue opened about incident review and post-mortems. I commented there that this was relevant to our interests at Etsy, and that at some point I would reach back out with work we’ve done in this area.
It took two years, but now we’re following up and hoping to reopen the dialogue.
Our Debriefing Facilitation Guide repo is here, and we hope that it is useful for other organizations.
Note: This article was adapted from an internal Etsy newsletter earlier this year. As the holidays roll around, it seemed like a timely opportunity to share what we do with a larger audience.
As the calendar year draws to a close, people’s thoughts often turn to fun activities like spending time with family and friends and enjoying pumpkin or peppermint flavored treats. But for retailers, the holiday season is an intense and critically important period for the business.
The months of November and December compose nearly a fifth of all US retail sales and pretty much every retailer needs to undertake special measures during the holidays, from big sales promotions to stocking up on popular items, to hiring additional staff to stock inventory and reduce wait times at checkout.
A lot of these measures apply as well to digital retailers, with the added risk of the entire site running slowly or not at all. In 2015, Neiman Marcus experienced an extended outage on Black Friday and Target and PayPal were intermittently down on Cyber Monday.
Etsy is no stranger to this holiday intensity. This is the biggest shopping season of the year for us and we typically receive more site visits than at other times, which translate into more orders. Over the years, our product, engineering, and member-facing organizations have developed practices and approaches to support our community during the intensity of the holidays.
How Etsy Handles the Holidays
The increase in site traffic and transactions impacts many areas of the business. Inbound support emails and Non-Delivery Cases reach a peak in December, and the Trust and Safety team ramps down outside project work and hiring efforts to focus on providing exceptional support.
“Emotions tend to be more heightened around the holidays” says Corinne Haxton Pavlovic, head of the trust and safety team at Etsy, “There’s a lot on the line for everyone this time of year – highs can feel higher and lows can feel lower. We have to really dig into our EQ and make sure that we’re staying neutral and empathetic.”
For our sellers, the holidays can be an exciting but scary time: “the Etsy sales equivalent of Laird Hamilton surfing a 70-ft wave off Oahu’s North Shore” says Joseph Stanek, seller account manager. Stanek works with a portfolio of top Etsy sellers to advise and support their business growth. He’s found that many sellers spend an enormous amount of effort on holiday sales promotion, and are then hit with a record number of orders. They’re pushed to increase their shipping and fulfillment capabilities, which can serve “as a kind of graduation” as they level up to a new tier of business.
With huge numbers of buyers browsing for the perfect gift, and sellers working hard to manage orders, it’s critically important for Etsy’s platform to be as clear and reliable as possible. That’s why the period between mid-October through December 31st is a time to be exceptionally careful and more conservative than usual about how we make changes to the site. We affectionately refer to this period as “Slush.”
The Origins of Slush
The actual term “Slush” is a play on the phrase “code freeze,” which is when a piece of software is deemed finished and thus no new changes are being made to the code base. “Code freezes help to ensure that the system will continue to operate without disruptions,” says Robert Tekiela at CTO Insights. It’s a way to prevent new bugs from being created and and are “commonly used in the retail industry during holiday shopping season when systems load is at a peak.”
Since at Etsy, we still push code during the holidays, just more carefully, it’s not a true code freeze, but a cold, melty mixture of water and ice. Hence, Slush.
According to Jason Wong, Slush at Etsy got started sometime after current CEO Chad Dickerson became CTO in the fall of 2008. As an engineering director of infrastructure, Jason has been a key part of Etsy’s platform stability since joining in 2010. Back then, Etsy’s infrastructure was less robust and the team was still figuring out how to effectively support the already high levels of traffic on Etsy.com. There was not yet a process for managing code deployments during the holidays and the site experienced more crashes than it does today. .
Said Wong, “the question was: during the holiday season, a high traffic, high visibility time when we made a significant portion of our [gross merchandise sales], how do we stabilize the site? That’s where Slush got started.”
Here’s the slightly redacted email from Chad that kicked off the idea of Slush.
From: Chad Dickerson
Date: Fri, Oct 31, 2008 at 4:08 PM
Subject: holiday “slush” (i.e. freeze) — need your input
To: adminName and adminName
Cc: adminName, adminName, and adminName
adminName / adminName,
adminName, adminName, and I met recently to discuss a holiday “freeze” beginning at end of day on November 14. We’re calling it a “slush” because there are certain types of projects that we can still do without making critical changes to the database or introducing bugs. The goal with setting this freeze is to eliminate the distractions of any projects not on the must-do list so we can focus on the most important projects exclusively.
adminName, adminName, adminName, and I met earlier this week to mutually agree on what projects we need to complete before the freeze/slush beginning at the end of the day on 11/14. We came up with a list of “must do” projects that need to get done by 11/14, and “nice to have” projects if we complete the must-dos:
[ Link to Document Detailing Slush Plan ]
There are couple of projects that we’ve discussed already on the list, like Super Etsy Mini and BuyHandmade blog. I wanted to make sure that any “must do” projects in your worlds are reflected and prioritized. On Monday, we’re going to start doing daily standups at 11:30am to track progress against the agreed-upon list of projects leading up to 11/14. Since there will be 9.5 working days to execute, we need to freeze the list itself by Monday. Can you review the list and let us know if you have additions that we are missing that we should discuss? Thanks.
Learning to Build Safely
In the early days, Slush was far more strict, in part because Etsy’s infrastructure was not as robust and mature as it is today. We operated off of a federated database model, which in theory was meant to prevent one database crash from affecting another, but in practice, it was hard to keep clusters from affecting one another and site stability suffered as a result. This technical approach also made it hard to understand what went wrong and how the team could fix it.
Engineers went from deploying five times a day to once a day. Feature flags were tested thoroughly so that major features like convos or add to cart could be turned off without shutting down the site.
Over the past few years, a major effort was made to get Etsy’s one really big box called master onto a sharded database model. With a sharded system, data is distributed across a series of smaller active-active pairs so that if single database goes down, there is an exact replica with the data intact. This system is faster, more scalable, and resilient compared to the prior method of simply storing all the data in one really big box. In 2016, we successfully migrated all of our key databases, including receipts, transactions, and many others, “to the shards” and decommissioned the old master database.
Developing continuous deployment was also a major feat which allowed Etsy to develop A/B testing and feature flagging. These technical efforts, in conjunction with our culture of examining failures through blameless postmortems, have allowed allowed Etsy to get better at building safely. Today our engineering staff and systems are studied by organizations around the world.
Slush Today and Tomorrow
Within the engineering organization, there’s often a senior staff member who helps organize Slush. The role is an informal one, meant to share best practices and encourage the product org to be mindful of the higher stakes of the holidays. Tim Falzone, an engineering manager on Infrastructure, took on this role in 2015 and presented a few slides at the September Engineering All-Hands which highlight the way we handle Slush today.
Today, Slush means that major buyer and seller facing feature changes are put on hold, or pushed “dark,” where they are hidden behind config flags and not shown publicly. Additionally, engineers get more people to review their pull requests of code changes. These extra precautions are taken to ensure that the site runs quickly and with minimal errors or downtime even with the increased traffic.
Falzone says that now, Slush is less about not breaking the site and more about preventing disruptions for members. “You could easily make something that works flawlessly but tanks conversion or otherwise sends buyers away, or is really slow,” he explained. For sellers, managing a huge wave of orders means relying on muscle memory of how Etsy works, which means that the holidays is a bad time to change the workflow or otherwise add friction for our sellers, who often become profitable on their business for the year during this time.
As Etsy grows, Slush will continue to evolve. A more powerful platform also means more points of integration. More traffic means more pressure on parts of the platform.
Even as we work to secure more headroom for our infrastructure and develop tooling to stress-test our systems, we will always be challenged in new ways. Though we’ve come a long way, Slush will continue to be a helpful reminder to move safely during a critical time of the year for our members and our organization.
This is the third post in a series of three about Etsy’s API, the abstract interface to our logic and data.In the last posts we covered how we built a new API framework, and we clearly identified the gains in terms of performance and shared abstraction layer between languages and devices. But how did we make an entire engineering organization switch to the new framework?
How did we achieve the cultural transformation to API first? How do we avoid this being the new thing that everyone knows about, but no one has the time to try?
How did we “sell” it?
In our case, we had multiple strategies that worked together.
The first one was communication: It was simple to write new endpoints, and the gains were very clear: Both the performance gain through concurrent data fetch, and the possibility to share an endpoint with the apps, were huge selling points.
We partnered with the Product organisation to make clear that they needed to include a standard question in their project templates “Is this being built on APIv3? If not, why not?”, which enforced the company strategy to adopt Mobile First development.
We had pilot groups that tried out the new API framework for new product features, we partnered closely with the Activity Feed team to help them adopt APIv3. This resulted in early converts that were strong advocates for wider adoption. The benefits were clear to communicate but also so compelling that we didn’t need to do much to sell teams, they could see what the activity feed had accomplished.
We had evolving documentation about the architecture and how to use the framework, in the API handbook. We had a codelab, which engineers could use for self study. In the lab you learn to build endpoints by building a little library app.
We had workshops in which people did the code lab together with experienced API developers.
We had tooling to learn about the system, such as the distributed tracing tool CrossStitch, the fanout manager informing about the complexity of the fetch, and datafindr for general information about API endpoints and example calls and results.
Once it’s going, too fast to keep up
After the word was spreading, so many people were making new endpoints that it was almost impossible for us to keep up with it. Initially, we alerted on each new endpoint, but had to switch this off due to the rate at which new endpoints were being built.
The biggest motivator for adoption was that we communicated the gains. Everyone became interested in the performance gains and the cross-device-sharing through abstraction, which was a motivating incentive to switch. Also, there was a potential speedup through caching of endpoints.
White space gets filled with code
Immediately after adoption started, we saw some misunderstandings in the code that we did not foresee. Such as magic hiding in traits, inheritance between endpoints, while they contain just declarative static functions, or really complex code in bespoke endpoints, which should be library code. The minimalist code required to write an endpoint didn’t make explicit what is missing by accidental omission vs. deliberate design decisions.
This also lead to confusion about which building blocks of endpoints are mandatory, about how to opt into a service, naming conventions, the location in the file system, and the complete route of an endpoint.
This was a caused some cleanup work for the API team, but in fact it was a good thing, caused by fast and wide adoption of a new system, and is probably inevitable in that case.
Documentation is critical
We addressed the problems by improving our documentation, and creating an interactive endpoint generator that creates a file in the right place, with stub functions, and outputs the route.
Also, we added format checks for endpoints in Jenkins. It was helpful to have the API code lab as a narrative introduction to gain practical experience fast, while also developing the API handbook as a reference manual. Two goals, two documents.
Pockets of Patterns
When we internally announced the project in which we unified the format for how an API endpoint is written, we started getting emails about what developers wanted from the API. And also: Mails about patterns that had emerged within the API framework without our knowledge. People had found and implemented solutions for their specific use cases. Our discussions had opened up a space to evaluate those patterns, and share them with all API endpoint developers!
We’ve learned that we need to stay on top of new patterns as they emerge, and keep talking to developers about their needs. A subtle, but powerful paradigm shift:
We, as the API framework developers, listen to endpoint developers and address everyone’s needs by evolving the framework. Instead of having developers using our API framework as a service, we are serving them.
Also, we learned to trust our fellow developers even more! We underestimated their curiosity and willingness to try out new things, we underestimated the adoption rate, and we also underestimated how creative they would be with finding solutions for their specific use cases that we did not plan for. What an awesome surprise. 😀
Types too late + types too loose D:
Also, we underestimated everyone’s willingness to do the work of typing their endpoint result. In the beginning, we made specifying a resultType optional, because we feared making it mandatory would slow the adoption. And if a type was specified, we only sampled the type errors, to not make correct typing a hurdle during API endpoint development, but rather a “nice to have” hint for when things go wrong. Not a guarantee. In retrospect, we could have saved ourselves a lot of extra work if we had made the resultType mandatory, and if we had made the type errors more prominent from the beginning.
Etsy’s developers are generally happy about result types, they helped to implement a coarse grained gradual typing, and actively pushed us towards making the result types mandatory, to rely on them as a guarantee.
Work in progress
There remain four open problems we all touched upon:
- Active Cache Invalidation – it’s hard.
- Caching the API at different geographic locations (Edge)
- opening up v3 to 3rd party developers
A question that comes with the developer adoption question is: What are we thinking in terms of preparing for third parties? This is a valid question, and we did not fully answer it yet, because we switched to generated client code in the meantime.
Even though we started announcing the new API v3 on our developer mailing lists, all third party apps are still on version 2. The platform for v3 is ready to open up to 3rd parties, as soon as Etsy as a whole decides to make the switch.
What did we learn? How were we transformed?
This story is a case study of how the API first approach transformed the architecture, and also transformed how we work with and think about our API at Etsy. It covers a lot of ground and is influenced by our infrastructure and history. The story of architectural decisions and adoption is transferable to other systems though. So what did we learn from the decisions? Or from the decisions not being made yet? Were there surprises?
We learned that we can seamlessly grow a system from a hack day project to a live system. Over time, a domain specific language of endpoints evolved, and the system grew according to the endpoint developers needs.
Also, we learned a great deal about cURL or the according php extension! Not only does it allow to make parallel requests, but it also let’s us check on, control and modify the in-flight requests in a non-blocking way.
Another realization is that we should have thought about caching early on. We added it in a hurry and are still working on active cache invalidation. So far, we can only use timeouts, which limits the class of endpoints that can be cached. Also, we have to think some more about variations due to different locales, which might only vary for parts of the response.
A huge positive surprise was the HHVM experiment. By teeing traffic and trying out a completely new system, we solved our performance problems to this day.
The textbook approach says: design APIs contract first. As we have seen, we circumvent that part by using automated client generation. Components help us with abstraction, but even the bespoke layer is very malleable. This is an interesting trick.
Another surprising learning is that we should trust our developers not only to adopt and adapt to the new API framework, but also to go all the way and make typing of endpoints mandatory from the start.
Also, we should keep the conversation ongoing to find out how the API framework can serve developers and their needs, instead of just designing it as a rigid service. We even found solutions that they had created themselves, and the framework could officially adopt them for everyone’s benefit.
Did it work? How did it end? Did we succeed?
Let’s assess this with a quote from Etsy’s Andrew Morrison, from before all this work started:
“We desperately need to figure out a scheme for allowing concurrency or else we’re going to have performance problems forever.”
YUP! We solved this, and a few unexpected things along the way. Despite initial problems with the extra layer, we figured out unconventional solutions by experimenting and organically growing the system towards the developers needs.
What we are discussing now is how to shift some of the complexity back towards the client. Maybe GraphQL is an interesting approach? Right now, the clients don’t know how the queries will be executed. This makes sense if you have services and teams and clear cut boundaries and interfaces, and if you have a contract in form of an API. Our approach is currently not structured like that.
But could we compile an alternative, more knowledgeable PHP client, that lifts the composability from the HTTP layer into the API consumer code, in cases where we are our own consumer and create a website of the same tree structure? It’s clear in this case how many endpoints we are calling, and as long as it’s us, it’s safe to lift this control beyond the client, into the view.
This is the third post in a series of three about Etsy’s API, the abstract interface to our logic and data.
‘Crashcan’ (think trashcan, but for crashes) is Etsy’s internal application for mobile crash analytics. We use an external partner to collect and store crash and exception data from our mobile apps, which we then ingest via their API. Crashcan is a single-page web app that refines and better exposes the crash data we receive from our external partner.
Crashcan gives us extra analysis of our crashes on top of what our partner offers. We can make less cautious assumptions about our stack traces internally, which allows us to group more of them together. It connects crashes to our user data and internal tooling. It allows us to search for crashes in a range of versions. It’s provided a good balance between building our own solution from scratch and completely relying on an external partner. We get the ability to customize our mobile crash reporting without having to maintain an entire crash reporting infrastructure.
Error Reporting – The Web vs. Mobile
Unfortunately, collecting mobile crashes and exceptions is (of necessity) quite different from most error reporting on the web, in several ways:
- On the web (especially the desktop web), we can be fairly confident that a user is online – they’re less prone to flaky or slow connections. Plus, users don’t expect to be able to access the web when they don’t have a connection.
- ‘Crashes’ are very different on the web, so many exceptions and errors are less severe. Sure, users may need to refresh a page, but it’s rare that a web page will crash their browser.
- We can watch graphs and logs live as we deploy on the web (hooray, continuous deployment!) – and it’s clear if hundreds or thousands of exceptions start pouring in. With our mobile apps, however, we have to wait for users to install new versions after a release makes it to the App Store. Only then do we get to see exceptions.
- With mobile, when a crash occurs, we normally can’t send the crash until the app is launched again (with a data connection) – in some cases, this can be days or weeks.
- App crashes are costly to the user, as the app crashing loses the user’s state. On the web, even if a page breaks for some reason, the user keeps their browser history.
- With the web, there’s one version of Etsy at any point in time. It’s updated continuously, and every user is running the latest version, always. With the apps, we have to deal with multiple versions at once, and we have to wait for users to update.
With these differences, it’s been important to approach the analysis of crashes and exceptions differently than we do on the web.
A Crash Analytics Wish List
Many of the issues mentioned above were handled by our external partner. But while this external partner provides a good overview of our mobile crashes, there were still some bits of functionality that would make analyzing crashes significantly easier for us. Some of the functionality we really wanted was:
- An easy way to filter broad ranges of app versions – like being able to specify 4.0-4.4 to find all crashes for versions between 22.214.171.124 and 4.4.999.999.
- Links between users’ accounts and specific crashes – like “This user reported they experienced this crash… Let’s find it.” This coupling with our user database allows us to better determine who is experiencing a crash – is it just sellers? Just buyers?
- Better crash de-duplication, specifically handling different versions of our apps and different versions of mobile operating systems. For example, crash traces may be almost identical, but with different line numbers or method names depending on the app version. But if they originate in the same place, we want to group them all together.
- Crash categorization – such as NullPointerExceptions versus OutOfMemory errors on Android – because some types of crashes are fairly easy to fix, while others (like OutOfMemory errors) are often systemic and unactionable.
- Custom alerting with various criteria – like when we experience a new crash with this keyword, or when an old version of our app suddenly experiences new API errors.
It seemed like it’d be fairly straightforward to build our own application, using the data we were already collecting, to implement this functionality. We wanted to augment the data we receive from our external partner with data and couple it with our own internal tooling. We also wanted to provide any interested Etsy employees with a simple way to view the overall health of our apps. So that’s exactly what we chose to do.
Crashcan’s structure was a pretty wide-open space. All it really needed to do was provide crash ingestion from an API, process the crash a bit, and expose it via a simple user interface (it sounds a lot like many technologies, actually). So while the options for technologies and methodologies were open, we ultimately decided to keep it simple.
By using PHP, Etsy’s primary development language, we keep the barrier to entry for developers at Etsy low . We used as much modern PHP as possible, with Composer handling our dependency management. MySQL handles the data storage, with Doctrine ORM providing the interface to the database.
Ingesting the data was the first hurdle. Before handling anything else, we needed to make sure that we could actually (1) get the data we wanted and (2) keep up with the number of crashes that we wanted to ingest, without breaking down our system. After all, if you can’t get the data you want and you can’t do so reliably, there’s really no point.
After analyzing the API endpoints we had at our fingertips (yay, documentation!), we determined that we could get all the data we wanted. The architecture needed to allow us to:
- Determine whether we already have a crash (regardless of whether it has been deduplicated on our end)
- Keep track of deduplicated crashes, and link them to the originating crash from the external provider
- Run complex queries to combine data
- Analyze whether crashes are meeting specific thresholds, like whether a new crash has occurred at least n times
- Count crashes by category
- Filter everything by version range and other criteria
In the end, we developed a schema that allowed us to fulfill all those needs while remaining quick in response to queries:
To actually ingest the data from our external provider, we run a cron job every minute that checks for new crashes. This cron runs a simple loop – it loads new crashes from a paginated endpoint, looping through each page and each crash in turn. Each crash is added to a queue so that we can update it asynchronously.
We run a series of workers that run continuously, monitoring the queue for incoming crashes. As these workers run, they each pick a crash off the queue and processes it. This includes several steps, first checking whether we have the crash already, then updating it if we have it or creating a new crash if we don’t. We also go through each crash’s occurrences to make sure that we’re recording each one and tying it to an existing user if one exists. The flowchart below demonstrates how these workers process crashes.
Monitoring & Alerting
After building Crashcan’s initial design and getting crashes ingesting correctly, we quickly realized that we needed utilities to monitor the data ingestion and to alert us when something went wrong. Initially, we had to manually compare crash counts in Crashcan with those that our external provider offered in their user interface. Obviously, this was neither convenient nor sustainable, so we began integrating StatsD and Nagios. To check that we were still ingesting all our crashes, we also wrote a script to perform a sort of ‘spot-check’ of our data against our external provider’s – which fails if our data differs too much from theirs.
We created a simple dashboard, linked to StatsD, that allows us to see at-a-glance if the ingestion is going well – or if we’re encountering errors, slowness, or hitting our API limit. While we plan to improve our alerting infrastructure over time, this has been serving us well for now – though before we got our monitoring in a good state, we hit some big snags that kept us from being able to use Crashcan for weeks at a time. There’s an important lesson there: plan for monitoring and alerting from the beginning.
When deciding on Crashcan’s structure, we decided to focus first on building a stable, straightforward API. This would enable us to expose our crash data to both users and other applications – with one interface for accessing the data. This meant that it was simple for us to build Crashcan’s user interface as a Single Page Application. Very few of the disadvantages of single page applications applied in Crashcan’s case, since our users would only be other Etsy employees. Building a robust API also enabled us to share the data easily with other applications inside Etsy – most especially with our internal app release platform.
When an Etsy engineer accesses Crashcan, we aim to present them with the most broadly applicable information first – the overall health of an app. This is presented through an overview of frequent crashes, common crash categories, and new crashes, along with a graph showing all crashes for the app split out by version. This makes it much easier to spot big new crashes or problematic releases. The engineer then has the option to narrow the scope of their search and view the details of specific crashes.
While we’ve finished Crashcan v1 with much of the core functionality and gotten it in a stable enough state that we can depend on its data, there’s still quite a bit that we’d like to improve. For example, we haven’t even begun to implement a couple of the items we mentioned in our wish list, like custom alerting. Second, the user interface could do with some bugfixes and refinement. Right now, it’s in a mostly-usable state that other Etsy engineers can at least tolerate, but it’s not stable or refined enough that we’d be comfortable releasing it to a wider audience.
Additionally, our crash deduplication is still rudimentary. It only performs simple (and expensive) string comparisons to find similar crashes. We’d like to implement more advanced and more efficient crash deduplication using crash signature generation. This would give us a much more reliable way of determining when crashes are related, therefore providing a more accurate picture of how common each crash is.
Most of the pain points in Crashcan’s development weren’t new or especially unexpected, but they serve as a valuable reminder of some important considerations when building new applications.
- Build with monitoring and alerting in mind from the beginning. We could’ve avoided a several-week-long lapse in functionality had we focused on building in monitoring from the beginning.
- Don’t be afraid to consult with others on structural or technical decisions, and then just make a decision. It’s something that I’ve always struggled with – but getting blocked on making decisions or digging too deep into the minutiae of every decision is a great way to waste time.
- Document your assumptions – especially when dealing with APIs – as small assumptions can turn into big deals later on. This is what led to our biggest failure – we mistakenly assumed that crash timestamps were accurate. When a crash said it had occurred 4 days in the future, our app stopped updating, because it was checking only crashes occurring after that crash.
External search engines like Google and Bing are a major source of traffic for Etsy, especially for our longer-tail, harder to find items, and thus Search Engine Optimization (SEO) is important in driving efficient listing discovery on our platform.
We want to make sure that our SEO strategy is data-driven and that we can be highly confident that whatever changes we implement will bring about positive results. At Etsy, we constantly run experiments to optimize the user experience and discovery across our platform, and we therefore naturally turned to experimentation for improving our SEO performance. While it is relatively simple to set up an experiment on-site on our own pages and apps, running experiments with SEO required changing how Etsy’s pages appeared in search engine results, over which we did not have direct control.
To overcome this limitation, we designed a slightly modified experimental design framework that allows us to effectively test how changes to our pages affect our SEO performance. This post explains the methodology behind our SEO testing, the challenges we have come across, and how we have resolved them.
For one of our experiments, we hypothesized that changing the titles our pages displayed in search results (a.k.a. ‘title tags’) could increase their clickthrough rate. Etsy has millions of pages generated off of user generated content that were suitable for a test. Many of these pages also receive the majority of their traffic through SEO.
Below is an example of a template we used when setting up a recent SEO title tag experiment.
We were inspired by SEO tests at Pinterest and Thumbtack and decided to set up a similar experiment where we randomly assigned our pages into different groups and applied different title tag phrasings shown above. We would measure the success of each test group by how much traffic it drove relative to the control groups. In this experiment, we also set up two control groups to have a higher degree of confidence in our results and to be able to quality check our randomized sampling once the experiment began.
We took a small sample of pages of a similar type while ensuring that our sample was large enough to allow us to reach statistical significance within a reasonable amount of time.
Because visits to individual pages are highly volatile, with many outliers and fluctuations from day to day, we had to create relatively large groups of 1000 pages each to expect to reach significance quickly. Furthermore, because of the high degree of variance across our pages, simple random sampling of our pages into test groups was creating test groups different from each other in a statistically significant way even before the experiment began.
To ensure our test groups were more comparable to each other, we used stratified sampling, where we first ranked the the pages to be a part of the test by visits, broke them down into ntile groups and then randomly assigned the pages from each ntile group into one of the test groups, ensuring to take a page from each ntile group. This ensured that our test groups were consistently representative of the overall sample and more reliably similar to each other.
We then looked at the statistical metrics for each test group over the preceding time period, calculating the mean and standard deviation values by month and running t-tests to ensure the groups were not different from each other in a statistically significant way. All test groups passed this test.
Estimating Causal Impact
Although the test groups in our experiment were not different from each other at a statistically significant level before the experiment, there were small differences that prevented the estimation of the exact causal impact post treatment. For example, test group XYZ might see an increase relative to control B, but if Control B was slightly better than test groups XYZ even before the experiment began, simply taking the difference between of the two groups would not be the best estimate of the difference the treatment had effected.
One common approach to resolve this problem is to calculate the difference of differences between the test and control groups pre- and post-treatment.
While this approach would have worked well, it might have created two different estimated treatment effect sizes when comparing the test groups against the two different control groups. We decided that, instead, using Bayesian structural time series analysis to create a synthetic control group incorporating information from both the control groups would provide a cleaner analysis of the results.
In this approach, a machine learning model is trained using pre-treatment data to predict the performance of each test group based on its covariance relative to its predictors — in our case, the two control groups. Once the model is trained, it is used to generate the counterfactual, synthetic control groups for each of the test groups, simulating what would have happened had the treatment not been applied.
The causal impact analysis in this experiment was implemented using the CausalImpact package by Google.
We started seeing the effects of our test treatments as soon as a few days after the experiment start date. Even seemingly very subtle title tag changes resulted in large and statistically significant changes in traffic to our pages.
In some test groups, we saw significant gains in traffic.
While in others, we saw no change.
And in some others, we even saw a strong negative change in traffic.
The two control groups in this test showed no statistically significant difference compared to each other after the experiment. Although a slight change was detected, the effect did not reach significance.
Post-experiment rollout validation
Once we identified the best performing title tag, the treatment was rolled out across all test groups. The other groups experienced similar lifts in traffic and the variance across buckets disappeared, further validating our results.
The fact that our two control groups saw no change when compared to each other, and also the fact that the other buckets experienced the same improvement in performance once the best performing treatment was applied to them gave us strong basis for confidence in the validity of our results.
It appeared in our results that shorter title tags performed better than longer ones. This might be because for shorter, better targeted title tags, there is a higher probability of a percentage match (that could be calculated using a metric like the Levenshtein Distance between the search query and the title tag) against any given user’s search query on Google.
In a similar hypothesis, it might be that using well-targeted title tags that are more textually similar to common search terms helps to increase percentage match to Google search terms and therefore improves ranking.
However, it is likely that different strategies work well for different websites, and we would recommend rigorous testing to uncover the best SEO strategy tailored for each individual case.
- Have two control groups for A-A testing. This allowed us to have much greater confidence in our results.
- The CausalImpact package can be used to easily account for small differences in test vs. control groups and estimate the differences of treatments more accurately.
- For title tags, it is most likely a best practice to use phrasing and wording that would maximize the probability of a low Levenshtein distance match from popular target search queries on Google
Visualization of Stratified Sampling