In which Etsy transforms its app release process by aligning it with its philosophy for web deploys
Deploying code should be easy. It should happen often, and it should involve its engineers. For Etsyweb, this looks like continuous deployment.
A group of engineers (which we call a push train) and a designated driver all shepherd their changes to a staging environment, and then to production. At each checkpoint along that journey, the members of the push train are responsible for testing their changes, sharing that they’re ready to ship, and making sure nothing broke. Everyone in that train must work together for the safe completion of their deployment. And this happens very frequently: up to 50 times a day.
mittens> .join TOPIC: mittens
sasha> .join with mittens TOPIC: mittens + sasha
pushbot> mittens, sasha: You're up TOPIC: mittens + sasha
sasha> .good TOPIC: mittens + sasha*
mittens> .good TOPIC: mittens* + sasha*
pushbot> mittens, sasha: Everyone is ready TOPIC: mittens* + sasha*
nassim> .join TOPIC: mittens* + sasha* | nassim
mittens> .at preprod TOPIC: <preprod> mittens + sasha | nassim
mittens> .good TOPIC: <preprod> mittens* + sasha | nassim
sasha> .good TOPIC: <preprod> mittens* + sasha* | nassim
pushbot> mittens, sasha: Everyone is ready TOPIC: <preprod> mittens* + sasha* | nassim
mittens> .at prod TOPIC: <prod> mittens + sasha | nassim
mittens> .good TOPIC: <prod> mittens* + sasha | nassim
asm> .join TOPIC: <prod> mittens* + sasha | nassim + asm
sasha> .good TOPIC: <prod> mittens* + sasha* | nassim + asm
asm> .nm TOPIC: <prod> mittens* + sasha* | nassim
pushbot> mittens, sasha: Everyone is ready TOPIC: <prod> mittens* + sasha* | nassim
mittens> .done TOPIC: nassim
pushbot> nassim: You're up TOPIC: nassim
lily> .join TOPIC: nassim | lily
This strategy has been successful for a lot of reasons, but especially because each deploy is handled by the people most familiar with the changes that are shipping. Those that wrote the code are in the best position to recognize it breaking, and then fix it. Because of that, developers should be empowered to deploy code as needed, and remain close to its rollout.
App releases are a different beast. They don’t easily adapt to that philosophy of deploying code. For one, they have versions and need to be compiled. And since they’re distributed via app stores, those versions can take time to reach end users. Traditionally, these traits have led to strategies involving release branches and release managers. Our app releases started out this way, but we learned quickly that they didn’t feel very Etsy. And so we set out to change them.
Jen and Sasha
We were the release managers. Jen managed the Sell on Etsy apps, and I managed the Etsy apps. We were responsible for all release stage transitions, maintaining the schedule, and managing all the communications around releases. We were also responsible for resolving conflicts and coordinating cross-team resources in cases of bugs and urgent blockers to release.
Ready to Ship
A key part of our job was making sure everyone knew what they’re supposed to do and when they’re supposed to do it. The biggest such checkpoint is when a release branches — this is when we create a dedicated branch for the release off master, and master becomes the next release. This is scheduled and determines what changes make it into production for a given release. It’s very important to make sure that those changes are expected, and that they have been tested.
For Jen and me, it would’ve been impossible to keep track of the many changes in a release ourselves, and so it was our job to coordinate with the engineers that made the actual changes and make sure those changes were expected and tested. In practice, this meant sending emails or messaging folks when approaching certain checkpoints like branching. And likewise, if there were any storm warnings (such as show-stopping bugs), it was our responsibility to raise the flag to notify others.
Then Jen left Etsy for another opportunity, and I became a single-point-of-failure and a gatekeeper. Every release decision was funneled through me, and I was the only person able to make and execute those decisions.
I was overwhelmed. Frustrated. I was worried I’d be stuck navigating iTunes Connect and Google Play, and sending emails. And frankly, I didn’t want to be doing those things. I wanted those things to be automated. Give me a button to upload to iTunes Connect, and another to begin staged rollout on Google Play. Thinking about the ease of deploying on web just filled me with envy.
This time wasn’t easy for engineers either. Even back when we had two release managers, from an engineer’s perspective, this period of app releases wasn’t transparent. It was difficult to know what phase of release we were in. A large number of emails was sent, but few of them were targeted to those that actually needed them. We would generically send emails to one big list that included all four of our apps. And all kinds of emails would get sent there. Things that were FYI-only, and also things that required urgent attention. We were on the path to alert-fatigue.
All of this meant that engineers felt more like they were in the cargo hold, rather than in the cockpit. But that just didn’t fit with how we do things for web. It didn’t fit with our philosophy for deployment. We didn’t like it. We wanted something better, something that placed engineers in front of the tiller.
So we built a vessel that coordinates the status, schedule, communications, and deploy tools for app releases. Here’s how Ship helps:
- Keeps track of who committed changes to a release
- Sends Slack messages and emails to the right people about the relevant events
- Manages the state and schedule of all releases
It’s hard to imagine all of that abstractly, so here’s an example:
- A cron moves the release into “Testing” and generates testing build v184.108.40.206.
- Ship is notified of this and sends an email to Alicia with the build.
- Alicia installs the build, verifies her changes, and tells Ship she’s ready.
- The final testing finds no show-stopping issues
- A cron submits v4.64.0 to iTunes Connect for review.
- A cron checks iTunes Connect for the review status of this release, and updates Ship that it’s been approved.
- Ship emails Alicia and others letting them know the release is approved.
- A cron releases v4.64.0.
(Had Alicia committed to our Android app, a cron would instead begin staged rollout on Google Play.)
- Ship emails Alicia and others letting them know the release is out in production.
- Ship emails a report of top crashes to all the engineers in the release (including Alicia)
Before Ship, all of these components above would’ve been performed manually. But you’ll notice that release managers are missing from the above script; have we replaced release managers with all the automations in Ship?
Partially. Ship has a feature where each release is assigned a driver.
This driver is responsible for a bunch of things that we couldn’t or shouldn’t automate. Here’s what they’re responsible for:
- Schedule changes
- Shepherding ‘ready to ships’ from other engineers
- Investigating showstopping bugs before release
Everything else? That’s automated. Branching, release candidate generation, submission to iTunes Connect — even staged rollout on Google Play! But, we’ve learned from automation going awry before. By default, some things are set to manual. There are others for which Ship explicitly does not allow automation, such as continuing staged rollout on Google Play. Things like this should involve and require human interaction. For everything else that is automated, we added a failsafe: at any time, a driver can disable all the crons and take over driving from autopilot:
When a driver wants to do something manually, they don’t need access to iTunes Connect or Google Play, as each of these things is made accessible as a button. A really nice side effect of this is that we don’t have to worry about provisioning folks for either app store, and we have a clear log of every release-related action taken by drivers.
Drivers are assigned once a release moves onto master, and are semi-randomly selected based on previous drivers and engineers that have committed to previous releases. Once assigned, we send them an onboarding email letting them know what their responsibilities are:
Ready to Ship Again
The driver can remain mostly dormant until the day of branching. A couple hours before we branch, it’s the driver’s responsibility to make sure that all the impacting engineers are ready to ship, and to orchestrate efforts when they’re not. After we’re ready, the driver’s responsibility is to remain available as a point-of-contact while final testing takes place. If an issue comes up, the driver may be consulted for steps to resolve.
And then, assuming all goes well, comes release day. The driver can opt to manually release, or let the cron do this for them — they’ll get notified if something goes wrong, either way. Then a day after we release, the driver looks at all of our dashboards, logs, and graphs to confirm the health of the release.
But not all releases are planned. Things fail, and that’s expected. It’s naïve to assume some serious bug won’t ship with an app release. There’s plenty of things that can and will be the subject of a post-mortem. When one of those things happens, any engineer can spawn a bugfix release off the most-recently-released mainline release.
The engineer that requests this bugfix gets assigned as the driver for that release. Once they branch the release, they make the necessary bugfixes (others can join in to add bugfixes too, if they coordinate with the driver) in the release’s branch, build a release candidate, test it, and get it ready for production. The driver can then release it at will.
Releases are actually quite complicated.
It starts off as an abstract thing that will occur in the future. Then becomes a concrete thing actively collecting changes via commits on master in git. After this period of collecting commits, the release is considered complete and moves into its own dedicated branch. The release candidate is then built from this dedicated branch, which then gets thoroughly tested, and moved into production. The release itself then concludes as an unmerged branch.
Once a release branches, the next future release moves onto master. Each release is its own state machine, where the development and branching states overlap between successive releases.
Notifications: Slack and Email
Plugged into the output of Ship are notifications. Because there are so many points of interest en route to production, it’s really important that the right people are notified at the right times. So we use the state machine of Ship to send out notifications to engineers (and other subscribers) based on how much they asked to know, and how they impacted the release. We also allow anyone to sign up for notifications around a release. This is used by product managers, designers, support teams, engineering managers, and more. Our communications are very targeted to those that need or want them.
In terms of what they asked to know, we made it very simple to get detailed emails about state changes to a release:
In terms of how they impacted the release, we need to get that data from somewhere else.
We mentioned data Ship receives from outside sources. At Etsy, we use GitHub for our source control. Our apps have repos per-platform (Android and iOS). In order to keep Ship’s knowledge of releases up-to-date, we set up GitHub Webhooks to notify Ship whenever changes are pushed to the repo. We listen for two changes in particular: pushes to master, and pushes to any release branch.
When Ship gets notified, it iterates through the commits and uses the author, changed paths, and commit message to determine which app (buyer or seller) the commit affects, and which release we should attribute this change to. Ship then takes all of that and combines it into a state that represents every engineer’s impact on a given release. Is that engineer “user-impacting” or “dark” (our term for changes that aren’t live)? Ship then uses this state to determine who is a member of what release, and who should get notified about what events.
Additionally, at any point during a release, an engineer can change their status. They may want to do this if they want to receive more information about a release, or if Ship misunderstood one of their commits as being impacting to the release.
Everything up until has explained how Ship keeps track of things. But there’s been no explanation for how some of the automated actions affecting the app repo or things outside Etsy occur.
We have a home-grown tool for managing deploys called Deployinator, and we added app support. It can now perform mutating interactions with the app repos, as well as all the deploy actions related to Google Play and iTunes Connect. This is where we build the testing candidates, release candidate, branch the release, submit to iTunes Connect, and much more.
We opted to use Deployinator for a number of reasons:
- Etsy engineers are already familiar with it
- It’s our go-to environment for wrapping up a build process into a button
- Good for things that need individual run logs, and clear failures
In our custom stack, we have crons. This is how we branch on Tuesday evening (assuming everyone is ready). This is where we interface with Google Play and iTunes Connect. We make use of Google Play’s official API in a custom python module we wrote, and for iTunes Connect we use Spaceship to interface with the unofficial API.
The end result of Ship is that we’ve distributed release management. Etsy no longer has any dedicated release managers. But it does have an engineer who used to be one — and I even get to drive a release every now and then.
People cannot be fully automated away. That applies to our web deploys, and is equally true for app releases. Our new process works within that reality. It’s unique because it pushes the limit of what we thought could be automated. Yet, at the same time, it empowers our app engineers more than ever before. Engineers control when a release goes to prod. Engineers decide if we’re ready to branch. Engineers hit the buttons.
And that’s what Ship is really about. It empowers our engineers to deliver the best apps for our users. Ship puts engineers at the helm.
When a user searches for an item on Etsy, they don’t always type what they mean. Sometimes they type the query jewlery when they’re looking for jewelry; sometimes they just accidentally hit an extra key and type dresss instead of dress. To make their online shopping experience successful, we need to identify and fix these mistakes, and display search results for the canonical spelling of the actual intended query. With this motivation in mind, and through the efforts of the Data Science, Linguistic Tools, and the Search Infrastructure teams, we overhauled the way we do spelling correction at Etsy late last year.
The older service was based on a static mapping of misspelled words to their respective corrections, which was updated only infrequently. It would split a search query into its tokens, and replace any token that was identified as a misspelling with its correction. Although the service allowed a very fast way of retrieving a correction, it was not an ideal solution for a few reasons:
- It could only correct misspellings that had already been identified and added to the map; it would leave previously unseen misspellings uncorrected.
- There was no systematic process in place to update the misspelling-correction pairs in the map.
- There was no mechanism of inferring the word from the context of surrounding words. For instance, if someone intended to type butter dish, but instead typed buttor dish, the older system would correct that to button dish since buttor is a more commonly observed misspelling for button at Etsy.
- Additionally, the older system would not allow one-to-many mappings from misspellings to corrections. Storing both buttor -> button and buttor -> butter mappings simultaneously would not be possible without modifying the underlying data structure.
We, therefore, decided to upgrade the spelling correction service to one based on a statistical model. This model learns from historical user data on Etsy’s website and uses the context of surrounding words to offer the most probable correction.
Although this blog post outlines some aspects of the infrastructure components of the service, its main focus is on describing the statistical model used by it.
We use a model that is based upon the Noisy Channel Model, which was historically used to infer telegraph messages that got distorted over the line. In the context of a user typing an incorrectly spelled word on Etsy, the “distortion” could be from accidental typos or a result of the user not knowing the correct spelling. In terms of probabilities, our goal is to determine the probability of the correct word, conditional on the word typed by the user. That probability, as per Bayes’ rule, is proportional to the product of two probabilities:
We are able to estimate the probabilities on the right-hand side of the relation above, using historical searches performed by our users. We describe this in more detail below.
To extend this idea to multi-word phrases, we use a simple Hidden Markov Model (HMM) which determines the most probable sequence of tokens to suggest as a spelling correction. Markov refers to being able to predict the future state of the system based solely on its present state. Hidden refers to the system having hidden states that we are trying to discover. In our case, the hidden states are the correct spellings of the words the user actually typed.
The main components of an HMM are explained via the figure below. Assume, for the purpose of this post, that the user searched for Flower Girl Baske when they intended to search for Flower Girl Basket. The main components of an HMM are then as follows:
- Observed states: the words explicitly typed by the user, Flower Girl Baske (represented as circles).
- Hidden states: the correct spellings of the words the user intended to type, Flower Girl Basket (represented as squares).
- Emission probability: the conditional probability of observing a state given the hidden state.
- Transition probability: the probability of observing a state conditional upon the probability of the immediately previous observed state, like, for instance, in the figure below, the probability of transitioning from Girl to Baske, i.e., P(Baske|Girl). The fact that this probability does not depend on the probability of having observed Flower, the state that precedes Girl, illustrates the Markov property of the model.
Once the spelling correction service receives the query, it first splits the query into three tokens, and then suggests possible corrections for each of the tokens as shown in the figure below.
In each column, one of the correction possibilities represents the true hidden state of the token that was typed (emitted) by the user. The most probable sequence can be thought of as identifying the true hidden state for each token given the context of the previous token. To accomplish this, we need to know probabilities from two distributions: first, the probability of typing a misspelling given the correct spelling of the word, the emission probability, and, second, the probability of transitioning from one observed word to another, the transition probability.
Once we are able to calculate all the emission probabilities, and the transition probabilities needed, as described in the sections below, we can determine the most probable sequence using a common dynamic programming algorithm known as the Viterbi Algorithm.
The emission probability is estimated through an Error Model, which is a statistical model created by inferring historical corrections provided by our own users while making searches on Etsy. The heuristic we used is based on two conditions: a user’s search being followed immediately by another search similar to the first, and with the second search then leading to a listing click. We align these tokens to each other in a way that minimizes the number of edits needed on the misspelled query to transform it into the corrected second query.
We then make aligned splits on the characters, and calculate counts and associated conditional probabilities. We generate character level probabilities for four types of operations: substitution, insertion, deletion, and transposition. Of these, insertion and deletion also require us to also keep track of the context of neighboring letters — the probability of adding a letter after another, or removing a letter appearing after another, respectively.
To continue with the earlier example, when the user typed baske, the emitted token, they were probably trying to type basket, the hidden state for baske, which corresponds to the probability P(Baske|Basket). Error probabilities for tokens, such as these, are calculated by multiplying the character-level probabilities which are assumed to be independent. For instance, the probability of correcting the letter e by appending the letter t to it is given by:
Here, the numerator is the number of times in our dataset where we see an e being corrected to an et, and the denominator is the number of corrections where any letter was appended after e. Since we assume that the probability associated with a character remaining unchanged is 1, in our case, the probability P(Baske|Basket) is solely dependent on the insertion probability P(et|e).
Transition probabilities are estimated through a language model which is determined by calculating the unigram and bigram token frequencies seen in our historical search queries.
A specific instance of that, from the chosen example, would be the probability of going from one token in the Flower (first) column, say Flow, to a token, say Girl, in the Girl (second) column, which is represented as P(Girl|Flow). We are able to determine that probability from the following ratio of bigram token counts to unigram token counts:
The error models and language models are generated through Scalding-powered jobs that run on our production hadoop cluster.
Viterbi Algorithm Heuristic
To generate inferences using these models, we employ the Viterbi Algorithm to determine the optimal query, i.e., sequence of tokens to serve as a spelling correction. The main idea is to present the most probable sequence as the spelling correction. The iterative process goes, as per the figure above, from the first column of tokens to the last. At the nth iteration, we have the sequence with the maximum probability available from the previous iteration, and we choose the token from the nth column which increases the maximum probability from the previous iteration the most.
Let’s explain this more concretely by describing the third iteration for our main example: assume that we already have Flower girl as the most probable phrase at the second iteration. Now we pick the token from the third column that corresponds to the maximum of the following transition probability and emission probability products:
We can see from the figure above that the transition probability from Girl to Basket is high enough to overcome the lower emission probability of someone typing Baske when they mean to type Basket, and that, consequently, Basket is the word we want to add to the existing Flower Girl phrase.
We now know that Flower Girl Basket is the most probable correction as predicted by our models. The decision about whether we suggest this correction to the user, and how we suggest it, is made on the basis of its confidence score. We describe that process a little later in the post.
Hyperparameters and Model Training
We added a few hyperparameters to the bare bones model we have described so far. They are as follows:
- We added an exponent to the emission probability because we want the model to have some flexibility in weighting the error model component in relation to the language model component.
- Our threshold for offering a correction is a linear function of the number of tokens in the query. We, therefore, have a slope and the intercept of the threshold as a pair of hyperparameters.
- Finally, we also have a parameter corresponding to the probability of words our language model does not know about.
Our training data set consists of two types of examples: the first kind are of the form misspelling -> correct spelling, like jewlery -> jewelry, while the second kind are of the form correct spelling -> _, like dress -> _. The second type corresponds to queries that are already correct, and therefore don’t need a correction. This setup enables our model to distinguish between the populations of both correct and incorrect spellings, and to offer corrections for the latter.
We tune our hyperparameters via 10-fold cross-validation using a hill-climbing algorithm that maximizes the f-score on our training data set. The f-score is the harmonic mean of the precision and the recall. Precision measures how many of the corrections offered by the system are correct and, for our data set, is defined as:
Recall is a measure of coverage — it tries to answer the question, “how many of the spelling corrections that we should be offering are we offering?”. It is defined as:
Optimizing on the f-score is allows us to increase coverage of the spelling corrections we offer while keeping the number of invalid corrections offered in check.
Serving the Correction
We serve three types of corrections at Etsy based on the confidence we have in the correction. We use the following odds-ratio as our confidence score:
If the confidence is greater than a certain threshold, we display search results corresponding to the correction, along with a “Search instead of” link to see the results for the original query instead. If the confidence is below a certain threshold, we offer results of the original query, with a “Did you mean” option to the suggested correction. If we don’t have any results for the original query, but do have results for the corrected query, we display results for the correction independent of the confidence score.
Spelling Correction Service
With the new service in place, when a search is made on Etsy, we fire off requests to both the new Java-based spelling service and the existing search service. The response from the spelling service includes both the correction and its confidence score, and how we display the correction, based on the confidence score, is described in the previous section. For high-confidence corrections, we make an additional request for search results with the corrected query, and display those results if their count is greater than a certain threshold.
When the spelling service receives a query, it first splits it into its constituent tokens. It then makes independent suggestions for each of the tokens, and, finally, strings together from those suggestions a sequence of the most probable tokens. This corrected sequence is sent back to the web stack as a response to the spelling service request.
The correction suggestions for each token are generated by Lucene’s DirectSpellChecker class which queries an index generated from tokens that are generated through a Hadoop job. The Hadoop job counts tokens from historical queries, rejecting any tokens that appear too infrequently or those that appear only in search queries with very few search results. We have a daily cron that ships the tokens, along with the generated Error and Language model files, from HDFS to the boxes that host the spelling correction service. A periodic rolling restart of the service across all boxes ensures that the freshest models and index tokens are picked up by the service.
The original implementation of the model was limited, in the sense, that it could not suggest corrections that were, at the token level, more than two edits away from the original token. We have subsequently implemented an auxiliary framework of suggesting correction tokens that fixes this issue.
Another limitation was related to splitting and compounding of tokens in the original query to make suggestions. For instance, we were not able to suggest earring as a suggestion for ear ring. Some of our coworkers are working on modifying the service to accommodate corrections of this type.
Although we do use supervised learning to train the hyperparameters of our models, since launching the service we have additional inputs from our users which we can improve our models with. Specifically, users clicking the “Did you mean” link on our low-confidence corrections results page provides us with explicit positive feedback, while clicks on the “Search instead for” link on our high-confidence corrections results page provides us with negative feedback. The next major evolution of the model would be to explicitly use this feedback to improve corrections.
(This project was a collaboration between Melanie Gin and Benjamin Russell on the Linguistic Tools team who built out the infrastructure and front-end; Caitlin Cellier and Mohit Nayyar on the Data Science team who worked on implementing the statistical model; and Zhi-Da Zhong on the Search Infrastructure team who consulted on infrastructure design. A special thanks to Caitlin whose presentation some of these figures are copied from.)
In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools.
The ability to point a custom domain to a Pattern site is an especially popular feature; many Pattern sites use their own domain name, either registered directly on the Pattern dashboard, or linked to Pattern from a third-party registrar.
At launch, Pattern shops with custom domains were served over HTTP, while checkouts and other secure actions happened over secure connections with Etsy.com. This model isn’t ideal though; Google ranks pages with SSL slightly higher, and plans to increase the bump it gives to sites with SSL. That’s a big plus for our sellers. Plus, it’s the year 2017 – we don’t want to be serving pages over boring old HTTP if we can help it … and we really, really like little green lock icons.
In this post we’ll be looking at some interesting challenges you run into when building a system designed to serve HTTPS traffic for hundreds of thousands of domains.
How to HTTPS
First, let’s talk about how HTTPS works, and how we might set it up for a single domain.
Let’s not talk that much about how HTTPS works though, because that would take a long time. Very briefly: by HTTPS we mean HTTP over SSL, and by SSL we mean TLS. All of this is a way for your website to communicate securely with clients. When a client first begins communicating with your site, you provide them with a certificate, and the client verifies that your certificate comes from a trusted certificate authority. (This is a gross over-simplification, here’s a very interesting post with more detail.)
Suffice to say, if you want HTTPS, you need a certificate, and if you want a certificate you need to get one from a certificate authority. Until fairly recently, this is the point where you had to open up your wallet. There were a handful of certificate authorities, each with an array of confusing options and pricey up-sells.
You pick whichever feels right to you, you get out your credit card, and you wind up with a certificate that’s valid for one to three years. Some CA’s offer API’s, but three years is a relatively long time, so chances are you do this manually. Plus, who even knows which cell in which SSL pricing matrix gets you API access?
From here, if you’re managing a limited number of domains, typically what you do is upload your certificate and private key to your load balancer (or CDN) and it handles TLS termination for you. Clients communicate with your load balancer over HTTPS using your domain’s public certificate, and your load balancer makes internal requests to your application.
And that’s it! Your clients have little green padlocks in their address bars, and your application is serving payloads just like it used to.
More certificates, more problems
Manually issuing certificates works well enough if your application is served off a handful of top-level domains. If you need certificates for lots and lots of TLDs, it’s not going to work. Have we mentioned that Pattern is available for the very reasonable price of $15 per month? A real baseline requirement for us is we can’t pay more for a certificate than someone pays for a year of Pattern. And we certainly don’t have time to issue all those certificates by hand.
Until fairly recently, our only options at this point would have been 1) do a deal with one of the big certificate authorities that offers API access, or 2) become our own certificate authority. These are both expensive options (we don’t like expensive options).
Let’s Encrypt to the rescue
Luckily for us, in April of 2016, the Internet Security Research Group (ISRG), which includes founders from the Mozilla Foundation and the Electronic Frontier Foundation, launched a new certificate authority named Let’s Encrypt.
The ISRG has also designed a communication protocol, Automated Certificate Management Environment (ACME), that covers the process of issuing and renewing certificates for TLS termination. Let’s Encrypt provides its service for free, and exclusively via an API that implements the ACME protocol. (You can read about it in more detail here.)
This is great for our project: we found a certificate authority we feel really good about, and since it implements a freely published protocol, there are great open source libraries out there for interacting with it.
Building out certificate issuance
Let’s re-examine our problem now that we have a good way to get certificates:
- We have a large quantity of custom domains attached to existing Pattern sites; we need to generate certificates for all of those.
- We have a constant stream of new domain registrations for new Pattern sites; we’ll need to get new certificates on a rolling basis.
- Let’s Encrypt certificates last 90 days; we need to renew our SSL certificates fairly frequently.
All of these problems are solvable now! We chose an open source ACME client library for PHP called AcmePHP. Another exciting thing about Let’s Encrypt using an open protocol, there are open source server implementations as well as client implementations. So we were able to spin up an internal equivalent of Let’s Encrypt called boulder and do our development and testing inside our private network, without worrying about rate limits.
The service we built out to handle this ACME logic, store the certificates we receive, and keep track of which domains we have certificates for we named CertService. It’s a service that deals with certificates, you see. CertService communicates with Let’s Encrypt, and also exposes an API for other internal Etsy services. We’ll go into more detail on that later in this post.
TLS termination for lots and lots of domains
Now that we’ve build out a service that can have certificates issued, we need a place to put them, and we need to use them to do TLS termination for HTTPS requests to Pattern custom domains.
We handle this for *.etsy.com by putting our certificates directly onto our load balancers and giving them to our CDN’s. In planning this project, we thought about hundreds of thousands of certificates, each with a 90 day lifetimes. If we did a good job spacing out issuing and renewing our certificates, that would add up to thousands of daily write operations to our load balancers and CDN’s. That’s not a rate we were comfortable with; load balancer and CDN configuration changes are relatively high risk operations.
What we did instead, is create a pool of proxy servers, and use our load balancer to distribute HTTPS traffic to them. The proxy hosts handle the client SSL termination, and proxy internal requests to our web servers, much like the load balancer does for www.etsy.com.
Our proxy hosts run Apache and we leverage mod_ssl and mod_proxy to do TLS termination and proxy requests. To preserve client IP addresses, we make use of the PROXY protocol on our load balancer and mod_proxy_protocol on the proxy hosts. We’re also using mod_macro to avoid ever having to write out hundreds of thousands of virtual hosts declarations for hundreds of thousands of domains. All put together, it looks something like this:
ProxyPass / https://internal-web-vip/
ProxyPassReverse / https://internal-web-vip/
<Macro VHost $domain>
Use VHost custom-domain-1.com
Use VHost custom-domain-n.com
To connect all this together, our proxy hosts periodically query CertService for a list of recently modified custom domains. Each host then 1) fetches the new certificates from CertService, 2) writes them to disk, 3) regenerates a config like the one above, and 4) does a graceful restart of Apache. These restarts are staggered across our proxy pool so all but one of the hosts is available and receiving requests from the load balancer at any give time (fingers crossed).
How do we securely store lots of certificates?
Now that we’ve figured out how to programmatically request and renew SSL certificates via LetsEncrypt, we need to store these certificates in a secure way. To do this, there are some guarantees we need to make:
- Private keys are stored in a database segmented from other types of data
- Private keys encrypted at rest and never leave CertService in plaintext
- SSL key pair generation and LetsEncrypt communications take place only on trusted hosts
- Private keys can be retrieved only by the SSL terminating hosts
Guarantee #1 is a no-brainer. If an attacker were to compromise a datastore containing thousands of SSL private keys in plaintext, they would be be able to intercept critical data being sent to thousands of custom domains. Since security is about raising costs for an attacker – that is, making it harder for an attacker to succeed – we employ a number of techniques to secure our keys. Our first layer of defense is at an infrastructure level: private keys are stored in a MySQL database away from the rest of the network. We use iptables to limit who can connect to the MySQL server, and given that CertService is the only client that needs access, the scope is really narrow. This vastly reduces attack surface, especially in cases where an attacker is looking to pivot from another compromised server on the network. Iptables is also then used to lock down who can communicate to the CertService API; adding constraints to connectivity on top of a secure authentication scheme makes retrieving certificates that much more difficult. That addresses Guarantee #4.
Now that we’ve locked down access to the database, we need to make sure they’re stored encrypted. For this, we make use of a concept known as a hybrid cryptosystem. Hybrid cryptosystems, in a nutshell, combine asymmetric (public-key crypto) and symmetric cryptosystems. If you’re familiar with SSL, much of how we handle crypto here is analogous to Session Keys.
At the start of this process, we have two pieces of data: the SSL private key and its corresponding public key – the certificate. We don’t particularly care about the certificate since that is public by definition. We start by generating a domain specific AES-256 key and encrypt the SSL private key. This only technically addresses the issue of not having plaintext on disk; the encrypted SSL private key is stored right next to the AES key, which can be used to both encrypt and decrypt. An attacker who could steal the encrypted keys could also steal the AES key. To address this, we encrypt the AES key with a CertService public key. Now we have an encrypted SSL private key (encrypted with the AES-256 key) and an encrypted AES key (encrypted with CertService’s RSA-2048 public key). Now not only are keys truly stored encrypted on disk, they also cannot be decrypted at all on CertService. This means if an attacker were to break CertService’s authentication scheme, the most they would receive is an encrypted SSL private key; they would still need the CertService private key – available only on the SSL terminating hosts – to decrypt it. Now we’ve fully taken care of Guarantee #2.
The only Guarantee that remains is #3. If key generation were compromised, an attacker would be able to grab private keys before they were encrypted and stored. If LetsEncrypt communication were compromised, an attacker could use our keys to generate certificates for domains we’ve already authorized (they could technically authorize new ones, but that would be significantly more difficult) or even revoke certificates. Both of these cases would render the entire system untrustworthy. Instead, we limit this functionality to CertService and expose it as an API; that way, if the web server handling Pattern requests were broken into, the attacker would not be able to affect critical LetsEncrypt flows.
One of our stretch goals is to look into deploying HSMs. If there are bugs in the underlying software, the integrity of the entire system could be compromised thus voiding any guarantees we try to keep. While bugs are inevitable, moving critical cryptographic functions into secure hardware will mitigate their impact.
No cryptosystem is perfect, but we’ve reached our goal of significantly increasing attacker cost. On top of that, we’ve supplemented the cryptosystem with our usual host based alerting and monitoring. So not only will an attacker have to jump through several hoops to get those SSL key pairs, they will also have to do it without being detected.
After the build
With all of that wrapped up, we had a system to issue large numbers of certificates, securely store this data, and terminate TLS requests with it. At this point Etsy brought in a third-party security firm to do a round of penetration testing. This process finished up without finding any substantive security weaknesses, which gives us an added level of confidence in our system.
Once we’ve gained enough confidence, we will enable HSTS. This should be the final goal of any SSL rollout as it forces browsers to use encryption for all future communication. Without it, downgrade attacks could be used to intercept traffic and hijack session cookies.
Every Pattern site and linked domain now has a valid certificate stored in our system and ready to facilitate secure requests. This feature is rolled out across Pattern and all Pattern traffic is now pushed over HTTPS!
(This project was a collaboration between Ram Nadella, Andy Yaco-Mink and Nick Steele on the Pattern team; Omar and Ken Lee from Security; and Will Gallego from Operations. Big thanks to Dennis Olvany, Keyur Govande and Mike Adler for all their help.)
We recently ran a successful split-testing title tag experiment to improve our search engine optimization (SEO), the results and methodology of which we shared in a previous Code as Craft post. In this post, we wanted to share some of our further learnings from our SEO testing. We decided to double down on the success of our previous experiment by running a series of further SEO experiments, one of which included changes to our:
- Title tags
- Meta descriptions
We found three surprising results:
- Shortening our title tags showed improved performance in terms of visits (and other key metrics)
- Meta descriptions had a statistically significant impact on organic search traffic
- H1s had a statistically significant impact on organic search traffic
Some notes on our methodology
For full details on the setup and methodology of our SEO split testing methodology, please see the previous Code as Craft post referenced above.
For this particular test, we used six unique treatments and two control groups. The two control groups remained consistent with each other before and after the experiment.
To derive the exact estimated causal changes effected in visits, we most often used Causal Impact modeling as implemented in the CausalImpact package by Google to standardize the test buckets against the control buckets. In some cases, the difference in differences method was used because it was more reliable in estimating effect sizes when working with strong seasonality swings.
This experiment was also impacted by strong seasonality and external event effects related to the holidays, sports events and the US elections. The statistical modeling in our final experiment analysis was adjusted for these effects to ensure accurate measurement of the causal effects of the test variants.
Takeaway #1: Short Title Tags Win Again
The results of this experiment aligned with the findings from our previous SEO title tag experiments, where it appeared shorter title tags drove more visits. We have so far validated our hypothesis that shorter title tags perform better in title tags in multiple separate experiments including many different variations and now feel quite confident that shorter title tags perform better (as measured by organic traffic) for Etsy. We hypothesize that this effect could be taking place through a number of different causal mechanisms:
- Lower Levenshtein distance and/or higher percentage match to target search queries rewarded by Google’s search algorithm and thereby improving Etsy’s rankings in Google search results. Per Wikipedia: “the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.”
- Shorter title tags, consisting only of the target search keyword, appear more relevant/enticing to search users
Takeaway #2: Meta Descriptions Matter
We found that changes in the meta description of a page can lead to statistically significant changes to visits. It appeared that longer and descriptive meta descriptions performed better and that conversely, shorter, terse meta descriptions performed worse. We hypothesize that longer meta descriptions might perform better via two possible causal mechanisms:
- Longer meta descriptions take up more real estate in a search results page, improving CTRs
- Longer meta descriptions give an appearance of more authority or more content, improving CTRs
Takeaway #3: H1s matter
We found that an H1 change can have a statistically significant impact on organic search traffic. However, changes in the H1 section of a page appear to interact with changes in title tags in hard to predict ways. For example, in this experiment, a title tag change in a certain variant increased visits. However, when the title tag change was combined with an H1 change, the positive effect of the title tag change was dulled, even though an H1 change by itself in a different variant led to slight increases in visits. This highlights the importance of SEO testing before rolling out even seemingly minor changes to Etsy pages.
Accounting For Unexpected Events
The Donald Trump Effect
One example of an event we had to control for (among a number of others) was the “Donald Trump effect”. We observed a large bucket skew on November 9 and 10, 2016 in one of our test groups. Upon investigation, it was found that the skew was due to large increases (+2000% to +5180%) in daily visits to pages related to “Donald Trump” the day after the US Presidential elections. Although the spikes in traffic to these pages were short lived, lasting only several days, they nevertheless did have the potential to unduly bias or reduce the statistical significance of the results of our experiment. These pages were therefore removed and controlled for when conducting the causal impact analyses for this experiment.
Our SEO experiments illustrate the importance of running SEO tests before making changes to a web page and continue to offer surprising results. However, it is important to note that the results of our experiments are only true for Etsy and do not necessarily reflect best practices for sites generally. We would therefore encourage everyone to discover the strategy that works best for their website through rigorous SEO testing.
In 2012, I wrote a post for the Code As Craft blog about how we approach learning from accidents and mistakes at Etsy. I wrote about the perspectives and concepts behind what is known (in the world of Systems Safety and Human Factors) as the New View on “human error.” I also wrote about what it means for an organization to take a different approach, philosophically, to learn from accidents, and that Etsy was such an organization.
That post’s purpose was to conceptually point in a new direction, and was, necessarily, void of pragmatic guidance, advice, or suggestions on how to operationalize this perspective. Since then, we at Etsy have continued to explore and evolve our understanding of what taking this approach means, practically. For many organizations engaged in software engineering, the group “post-mortem” debriefing meeting (and accompanying documentation) is where the rubber meets the road.
Many responded to that original post with a question:
“Ok, you’ve convinced me that we have to take a different view on mistakes and accidents. Now: how do you do that?”
As a first step to answer that question, we’ve developed a new Debriefing Facilitation Guide which we are open-sourcing and publishing.
We wrote this document for two reasons.
The first is to state emphatically that we believe a “post-mortem” debriefing should be considered first and foremost a learning opportunity, not a fixing one. All too often, when teams get together to discuss an event, they walk into the room with a story they’ve already rationalized about what happened. This urge to point to a “root” cause is seductive — it allows us to believe that we’ve understood the event well enough, and can move on towards fixing things.
We believe this view is insufficient at best, and harmful at worst, and that gathering multiple diverse perspectives provides valuable context, without which you are only creating an illusion of understanding. Systems safety researcher Nancy Leveson, in her excellent book Engineering A Safer World, has this to say on the topic:
“An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate events: A narrow focus on operator actions, physical component failures, and technology may lead to ignoring some of the most important factors in terms of preventing future accidents. The whole concept of ‘root cause’ needs to be reconsidered.”
In other words, if we don’t pay attention to where and how we “look” to understand an event by considering a debriefing a true exploration with open minds, we can easily miss out on truly valuable understanding. How and where we pay attention to this learning opportunity begins with the debriefing facilitator.
The second reason is to help develop debriefing facilitation skills in our field. We wanted to provide practical guidance for debriefing facilitators as they think about preparing for, conducting, and navigating a post-event debriefing. We believe that organizational learning can only happen when objective data about the event (the type of data that you might put into a template or form) is placed into context with the subjective data that can only be gleaned by the skillful constructing of dialogue with the multiple, diverse perspectives in the room.
The Questions Are As Important As The Answers
In his book “Pre-Accident Investigations: Better Questions,” Todd Conklin sheds light on the idea that the focus on understanding complex work is not in the answers we assert, but in the questions we ask:
“The skill is not in knowing the right answer. Right answers are pretty easy. The skill is in asking the right question. The question is everything.”
What we learn from an event depends on the questions we ask as facilitators, not just the objective data you gather and put into a document. It is very easy to assume that the narrative of an accident can be drawn up from one person’s singular perspective, and that the challenge is what to do about it moving forward.
We do not believe that to be true. Here’s a narrative taken from real debriefing notes, generalized for this post:
“I don’t know,” the engineer said, when asked what happened. “I just wasn’t paying attention, I guess. This is on me. I’m sorry, everyone.” The outage had lasted only 9 minutes, but to the engineer it felt like a lifetime. The group felt a strange but comforting relief, ready to fill in the incident report with ‘engineer error’ and continue on with their morning work.
The facilitator was not ready to call it a ‘closed case.’
“Take us back to before you deployed, Phil…what do you remember? This looks like the change you prepped to deploy…” The facilitator displayed the code diff on the big monitor in the conference room.
Phil looked closely to the red and green lines on the screen and replied “Yep, that’s it. I asked Steve and Lisa for a code review, and they both said it looked good.” Steve and Lisa nod their heads, sheepishly.
“So after you got the thumbs-up from Steve and Lisa…what happened next?” the facilitator continued.
“Well, I checked it in, like I always do,” Phil replied. “The tests automatically run, so I waited for them to finish.” He paused for a moment. “I looked on the page that shows the test results, like this…” Phil brought up the page in his browser, on the large screen.
“Is that a new dashboard?” Lisa asked from the back of the room.
“Yeah, after we upgraded the Jenkins install, we redesigned the default test results page to the previous colors because the new one was hard to read,” replied Sam, the team lead for the automated testing team.
“The page says eight tests failed.” Lisa replied. Everyone in the room squinted.
“No, it says zero tests failed, see…?” Phil said, moving closer to the monitor.
Phil hit control-+ on his laptop, increasing the size of the text on the screen. “Oh my. I swear that it said zero tests failed when I deployed.”
The facilitator looked at the rest of the group in the conference room. “How many folks in the room saw this number as a zero when Phil first put it up on the screen?” Most of the group’s hands went up. Lisa smiled.
“It looked like a zero to me too,” the facilitator said.
“Huh. I think because this small font puts slashes in its zeros and it’s in italics, an eight looks a lot like a zero,” Sam said, taking notes. “We should change that.”
As a facilitator, it would be easy to stop asking questions at the mea culpa given by Phil. Without asking him to describe how he normally does his work, by bringing us back to what he was doing at the time, what he was focused on, what led him to believe that deploying was going to happen without issue, we might never have considered that the automated test results page could use some design changes to make the number of failed tests clearer, visually.
In another case, an outage involved a very complicated set of non-obvious interactions between multiple components. During the debriefing, the facilitator asked each of the engineers who designed the systems and were familiar with the architecture to draw on a whiteboard the picture that they had in their minds when they think about how it all is expected to work.
When seen together, each of the diagrams from the different engineers painted a fuller picture of how the components worked together than if there was only one engineer attempting to draw out a “comprehensive” and “canonical” diagram.
The process of drawing this diagram together also brought the engineers to say aloud what they were and were not sure about, and that enabled others in the room to repair those uncertainties and misunderstandings.
Both of these cases support the perspective we take at Etsy, which is that debriefings can (and should!) serve multiple purposes, not just a simplistic hunt for remediation items.
By placing the focus explicitly on learning first, a well-run debriefing has the potential to educate the organization on not just what didn’t work well on the day of the event in question, but a whole lot more.
The key, then, to having well-run debriefings is to start treating facilitation as a skill in which to invest time and effort.
Influencing the Industry
We believe that publishing this guide will help other organizations think about the way that they train and nurture the skill of debriefing facilitation. Moreover, we also hope that it can continue the greater dialogue in the industry about learning from accidents in software-rich environments.
In 2014, knowing how critical I believe this topic to be, someone alerted me to the United States Digital Service Play Books repo on Github, and an issue opened about incident review and post-mortems. I commented there that this was relevant to our interests at Etsy, and that at some point I would reach back out with work we’ve done in this area.
It took two years, but now we’re following up and hoping to reopen the dialogue.
Our Debriefing Facilitation Guide repo is here, and we hope that it is useful for other organizations.
Note: This article was adapted from an internal Etsy newsletter earlier this year. As the holidays roll around, it seemed like a timely opportunity to share what we do with a larger audience.
As the calendar year draws to a close, people’s thoughts often turn to fun activities like spending time with family and friends and enjoying pumpkin or peppermint flavored treats. But for retailers, the holiday season is an intense and critically important period for the business.
The months of November and December compose nearly a fifth of all US retail sales and pretty much every retailer needs to undertake special measures during the holidays, from big sales promotions to stocking up on popular items, to hiring additional staff to stock inventory and reduce wait times at checkout.
A lot of these measures apply as well to digital retailers, with the added risk of the entire site running slowly or not at all. In 2015, Neiman Marcus experienced an extended outage on Black Friday and Target and PayPal were intermittently down on Cyber Monday.
Etsy is no stranger to this holiday intensity. This is the biggest shopping season of the year for us and we typically receive more site visits than at other times, which translate into more orders. Over the years, our product, engineering, and member-facing organizations have developed practices and approaches to support our community during the intensity of the holidays.
How Etsy Handles the Holidays
The increase in site traffic and transactions impacts many areas of the business. Inbound support emails and Non-Delivery Cases reach a peak in December, and the Trust and Safety team ramps down outside project work and hiring efforts to focus on providing exceptional support.
“Emotions tend to be more heightened around the holidays” says Corinne Haxton Pavlovic, head of the trust and safety team at Etsy, “There’s a lot on the line for everyone this time of year – highs can feel higher and lows can feel lower. We have to really dig into our EQ and make sure that we’re staying neutral and empathetic.”
For our sellers, the holidays can be an exciting but scary time: “the Etsy sales equivalent of Laird Hamilton surfing a 70-ft wave off Oahu’s North Shore” says Joseph Stanek, seller account manager. Stanek works with a portfolio of top Etsy sellers to advise and support their business growth. He’s found that many sellers spend an enormous amount of effort on holiday sales promotion, and are then hit with a record number of orders. They’re pushed to increase their shipping and fulfillment capabilities, which can serve “as a kind of graduation” as they level up to a new tier of business.
With huge numbers of buyers browsing for the perfect gift, and sellers working hard to manage orders, it’s critically important for Etsy’s platform to be as clear and reliable as possible. That’s why the period between mid-October through December 31st is a time to be exceptionally careful and more conservative than usual about how we make changes to the site. We affectionately refer to this period as “Slush.”
The Origins of Slush
The actual term “Slush” is a play on the phrase “code freeze,” which is when a piece of software is deemed finished and thus no new changes are being made to the code base. “Code freezes help to ensure that the system will continue to operate without disruptions,” says Robert Tekiela at CTO Insights. It’s a way to prevent new bugs from being created and and are “commonly used in the retail industry during holiday shopping season when systems load is at a peak.”
Since at Etsy, we still push code during the holidays, just more carefully, it’s not a true code freeze, but a cold, melty mixture of water and ice. Hence, Slush.
According to Jason Wong, Slush at Etsy got started sometime after current CEO Chad Dickerson became CTO in the fall of 2008. As an engineering director of infrastructure, Jason has been a key part of Etsy’s platform stability since joining in 2010. Back then, Etsy’s infrastructure was less robust and the team was still figuring out how to effectively support the already high levels of traffic on Etsy.com. There was not yet a process for managing code deployments during the holidays and the site experienced more crashes than it does today. .
Said Wong, “the question was: during the holiday season, a high traffic, high visibility time when we made a significant portion of our [gross merchandise sales], how do we stabilize the site? That’s where Slush got started.”
Here’s the slightly redacted email from Chad that kicked off the idea of Slush.
From: Chad Dickerson
Date: Fri, Oct 31, 2008 at 4:08 PM
Subject: holiday “slush” (i.e. freeze) — need your input
To: adminName and adminName
Cc: adminName, adminName, and adminName
adminName / adminName,
adminName, adminName, and I met recently to discuss a holiday “freeze” beginning at end of day on November 14. We’re calling it a “slush” because there are certain types of projects that we can still do without making critical changes to the database or introducing bugs. The goal with setting this freeze is to eliminate the distractions of any projects not on the must-do list so we can focus on the most important projects exclusively.
adminName, adminName, adminName, and I met earlier this week to mutually agree on what projects we need to complete before the freeze/slush beginning at the end of the day on 11/14. We came up with a list of “must do” projects that need to get done by 11/14, and “nice to have” projects if we complete the must-dos:
[ Link to Document Detailing Slush Plan ]
There are couple of projects that we’ve discussed already on the list, like Super Etsy Mini and BuyHandmade blog. I wanted to make sure that any “must do” projects in your worlds are reflected and prioritized. On Monday, we’re going to start doing daily standups at 11:30am to track progress against the agreed-upon list of projects leading up to 11/14. Since there will be 9.5 working days to execute, we need to freeze the list itself by Monday. Can you review the list and let us know if you have additions that we are missing that we should discuss? Thanks.
Learning to Build Safely
In the early days, Slush was far more strict, in part because Etsy’s infrastructure was not as robust and mature as it is today. We operated off of a federated database model, which in theory was meant to prevent one database crash from affecting another, but in practice, it was hard to keep clusters from affecting one another and site stability suffered as a result. This technical approach also made it hard to understand what went wrong and how the team could fix it.
Engineers went from deploying five times a day to once a day. Feature flags were tested thoroughly so that major features like convos or add to cart could be turned off without shutting down the site.
Over the past few years, a major effort was made to get Etsy’s one really big box called master onto a sharded database model. With a sharded system, data is distributed across a series of smaller active-active pairs so that if single database goes down, there is an exact replica with the data intact. This system is faster, more scalable, and resilient compared to the prior method of simply storing all the data in one really big box. In 2016, we successfully migrated all of our key databases, including receipts, transactions, and many others, “to the shards” and decommissioned the old master database.
Developing continuous deployment was also a major feat which allowed Etsy to develop A/B testing and feature flagging. These technical efforts, in conjunction with our culture of examining failures through blameless postmortems, have allowed allowed Etsy to get better at building safely. Today our engineering staff and systems are studied by organizations around the world.
Slush Today and Tomorrow
Within the engineering organization, there’s often a senior staff member who helps organize Slush. The role is an informal one, meant to share best practices and encourage the product org to be mindful of the higher stakes of the holidays. Tim Falzone, an engineering manager on Infrastructure, took on this role in 2015 and presented a few slides at the September Engineering All-Hands which highlight the way we handle Slush today.
Today, Slush means that major buyer and seller facing feature changes are put on hold, or pushed “dark,” where they are hidden behind config flags and not shown publicly. Additionally, engineers get more people to review their pull requests of code changes. These extra precautions are taken to ensure that the site runs quickly and with minimal errors or downtime even with the increased traffic.
Falzone says that now, Slush is less about not breaking the site and more about preventing disruptions for members. “You could easily make something that works flawlessly but tanks conversion or otherwise sends buyers away, or is really slow,” he explained. For sellers, managing a huge wave of orders means relying on muscle memory of how Etsy works, which means that the holidays is a bad time to change the workflow or otherwise add friction for our sellers, who often become profitable on their business for the year during this time.
As Etsy grows, Slush will continue to evolve. A more powerful platform also means more points of integration. More traffic means more pressure on parts of the platform.
Even as we work to secure more headroom for our infrastructure and develop tooling to stress-test our systems, we will always be challenged in new ways. Though we’ve come a long way, Slush will continue to be a helpful reminder to move safely during a critical time of the year for our members and our organization.
This is the third post in a series of three about Etsy’s API, the abstract interface to our logic and data.In the last posts we covered how we built a new API framework, and we clearly identified the gains in terms of performance and shared abstraction layer between languages and devices. But how did we make an entire engineering organization switch to the new framework?
How did we achieve the cultural transformation to API first? How do we avoid this being the new thing that everyone knows about, but no one has the time to try?
How did we “sell” it?
In our case, we had multiple strategies that worked together.
The first one was communication: It was simple to write new endpoints, and the gains were very clear: Both the performance gain through concurrent data fetch, and the possibility to share an endpoint with the apps, were huge selling points.
We partnered with the Product organisation to make clear that they needed to include a standard question in their project templates “Is this being built on APIv3? If not, why not?”, which enforced the company strategy to adopt Mobile First development.
We had pilot groups that tried out the new API framework for new product features, we partnered closely with the Activity Feed team to help them adopt APIv3. This resulted in early converts that were strong advocates for wider adoption. The benefits were clear to communicate but also so compelling that we didn’t need to do much to sell teams, they could see what the activity feed had accomplished.
We had evolving documentation about the architecture and how to use the framework, in the API handbook. We had a codelab, which engineers could use for self study. In the lab you learn to build endpoints by building a little library app.
We had workshops in which people did the code lab together with experienced API developers.
We had tooling to learn about the system, such as the distributed tracing tool CrossStitch, the fanout manager informing about the complexity of the fetch, and datafindr for general information about API endpoints and example calls and results.
Once it’s going, too fast to keep up
After the word was spreading, so many people were making new endpoints that it was almost impossible for us to keep up with it. Initially, we alerted on each new endpoint, but had to switch this off due to the rate at which new endpoints were being built.
The biggest motivator for adoption was that we communicated the gains. Everyone became interested in the performance gains and the cross-device-sharing through abstraction, which was a motivating incentive to switch. Also, there was a potential speedup through caching of endpoints.
White space gets filled with code
Immediately after adoption started, we saw some misunderstandings in the code that we did not foresee. Such as magic hiding in traits, inheritance between endpoints, while they contain just declarative static functions, or really complex code in bespoke endpoints, which should be library code. The minimalist code required to write an endpoint didn’t make explicit what is missing by accidental omission vs. deliberate design decisions.
This also lead to confusion about which building blocks of endpoints are mandatory, about how to opt into a service, naming conventions, the location in the file system, and the complete route of an endpoint.
This was a caused some cleanup work for the API team, but in fact it was a good thing, caused by fast and wide adoption of a new system, and is probably inevitable in that case.
Documentation is critical
We addressed the problems by improving our documentation, and creating an interactive endpoint generator that creates a file in the right place, with stub functions, and outputs the route.
Also, we added format checks for endpoints in Jenkins. It was helpful to have the API code lab as a narrative introduction to gain practical experience fast, while also developing the API handbook as a reference manual. Two goals, two documents.
Pockets of Patterns
When we internally announced the project in which we unified the format for how an API endpoint is written, we started getting emails about what developers wanted from the API. And also: Mails about patterns that had emerged within the API framework without our knowledge. People had found and implemented solutions for their specific use cases. Our discussions had opened up a space to evaluate those patterns, and share them with all API endpoint developers!
We’ve learned that we need to stay on top of new patterns as they emerge, and keep talking to developers about their needs. A subtle, but powerful paradigm shift:
We, as the API framework developers, listen to endpoint developers and address everyone’s needs by evolving the framework. Instead of having developers using our API framework as a service, we are serving them.
Also, we learned to trust our fellow developers even more! We underestimated their curiosity and willingness to try out new things, we underestimated the adoption rate, and we also underestimated how creative they would be with finding solutions for their specific use cases that we did not plan for. What an awesome surprise. 😀
Types too late + types too loose D:
Also, we underestimated everyone’s willingness to do the work of typing their endpoint result. In the beginning, we made specifying a resultType optional, because we feared making it mandatory would slow the adoption. And if a type was specified, we only sampled the type errors, to not make correct typing a hurdle during API endpoint development, but rather a “nice to have” hint for when things go wrong. Not a guarantee. In retrospect, we could have saved ourselves a lot of extra work if we had made the resultType mandatory, and if we had made the type errors more prominent from the beginning.
Etsy’s developers are generally happy about result types, they helped to implement a coarse grained gradual typing, and actively pushed us towards making the result types mandatory, to rely on them as a guarantee.
Work in progress
There remain four open problems we all touched upon:
- Active Cache Invalidation – it’s hard.
- Caching the API at different geographic locations (Edge)
- opening up v3 to 3rd party developers
A question that comes with the developer adoption question is: What are we thinking in terms of preparing for third parties? This is a valid question, and we did not fully answer it yet, because we switched to generated client code in the meantime.
Even though we started announcing the new API v3 on our developer mailing lists, all third party apps are still on version 2. The platform for v3 is ready to open up to 3rd parties, as soon as Etsy as a whole decides to make the switch.
What did we learn? How were we transformed?
This story is a case study of how the API first approach transformed the architecture, and also transformed how we work with and think about our API at Etsy. It covers a lot of ground and is influenced by our infrastructure and history. The story of architectural decisions and adoption is transferable to other systems though. So what did we learn from the decisions? Or from the decisions not being made yet? Were there surprises?
We learned that we can seamlessly grow a system from a hack day project to a live system. Over time, a domain specific language of endpoints evolved, and the system grew according to the endpoint developers needs.
Also, we learned a great deal about cURL or the according php extension! Not only does it allow to make parallel requests, but it also let’s us check on, control and modify the in-flight requests in a non-blocking way.
Another realization is that we should have thought about caching early on. We added it in a hurry and are still working on active cache invalidation. So far, we can only use timeouts, which limits the class of endpoints that can be cached. Also, we have to think some more about variations due to different locales, which might only vary for parts of the response.
A huge positive surprise was the HHVM experiment. By teeing traffic and trying out a completely new system, we solved our performance problems to this day.
The textbook approach says: design APIs contract first. As we have seen, we circumvent that part by using automated client generation. Components help us with abstraction, but even the bespoke layer is very malleable. This is an interesting trick.
Another surprising learning is that we should trust our developers not only to adopt and adapt to the new API framework, but also to go all the way and make typing of endpoints mandatory from the start.
Also, we should keep the conversation ongoing to find out how the API framework can serve developers and their needs, instead of just designing it as a rigid service. We even found solutions that they had created themselves, and the framework could officially adopt them for everyone’s benefit.
Did it work? How did it end? Did we succeed?
Let’s assess this with a quote from Etsy’s Andrew Morrison, from before all this work started:
“We desperately need to figure out a scheme for allowing concurrency or else we’re going to have performance problems forever.”
YUP! We solved this, and a few unexpected things along the way. Despite initial problems with the extra layer, we figured out unconventional solutions by experimenting and organically growing the system towards the developers needs.
What we are discussing now is how to shift some of the complexity back towards the client. Maybe GraphQL is an interesting approach? Right now, the clients don’t know how the queries will be executed. This makes sense if you have services and teams and clear cut boundaries and interfaces, and if you have a contract in form of an API. Our approach is currently not structured like that.
But could we compile an alternative, more knowledgeable PHP client, that lifts the composability from the HTTP layer into the API consumer code, in cases where we are our own consumer and create a website of the same tree structure? It’s clear in this case how many endpoints we are calling, and as long as it’s us, it’s safe to lift this control beyond the client, into the view.
This is the third post in a series of three about Etsy’s API, the abstract interface to our logic and data.
‘Crashcan’ (think trashcan, but for crashes) is Etsy’s internal application for mobile crash analytics. We use an external partner to collect and store crash and exception data from our mobile apps, which we then ingest via their API. Crashcan is a single-page web app that refines and better exposes the crash data we receive from our external partner.
Crashcan gives us extra analysis of our crashes on top of what our partner offers. We can make less cautious assumptions about our stack traces internally, which allows us to group more of them together. It connects crashes to our user data and internal tooling. It allows us to search for crashes in a range of versions. It’s provided a good balance between building our own solution from scratch and completely relying on an external partner. We get the ability to customize our mobile crash reporting without having to maintain an entire crash reporting infrastructure.
Error Reporting – The Web vs. Mobile
Unfortunately, collecting mobile crashes and exceptions is (of necessity) quite different from most error reporting on the web, in several ways:
- On the web (especially the desktop web), we can be fairly confident that a user is online – they’re less prone to flaky or slow connections. Plus, users don’t expect to be able to access the web when they don’t have a connection.
- ‘Crashes’ are very different on the web, so many exceptions and errors are less severe. Sure, users may need to refresh a page, but it’s rare that a web page will crash their browser.
- We can watch graphs and logs live as we deploy on the web (hooray, continuous deployment!) – and it’s clear if hundreds or thousands of exceptions start pouring in. With our mobile apps, however, we have to wait for users to install new versions after a release makes it to the App Store. Only then do we get to see exceptions.
- With mobile, when a crash occurs, we normally can’t send the crash until the app is launched again (with a data connection) – in some cases, this can be days or weeks.
- App crashes are costly to the user, as the app crashing loses the user’s state. On the web, even if a page breaks for some reason, the user keeps their browser history.
- With the web, there’s one version of Etsy at any point in time. It’s updated continuously, and every user is running the latest version, always. With the apps, we have to deal with multiple versions at once, and we have to wait for users to update.
With these differences, it’s been important to approach the analysis of crashes and exceptions differently than we do on the web.
A Crash Analytics Wish List
Many of the issues mentioned above were handled by our external partner. But while this external partner provides a good overview of our mobile crashes, there were still some bits of functionality that would make analyzing crashes significantly easier for us. Some of the functionality we really wanted was:
- An easy way to filter broad ranges of app versions – like being able to specify 4.0-4.4 to find all crashes for versions between 220.127.116.11 and 4.4.999.999.
- Links between users’ accounts and specific crashes – like “This user reported they experienced this crash… Let’s find it.” This coupling with our user database allows us to better determine who is experiencing a crash – is it just sellers? Just buyers?
- Better crash de-duplication, specifically handling different versions of our apps and different versions of mobile operating systems. For example, crash traces may be almost identical, but with different line numbers or method names depending on the app version. But if they originate in the same place, we want to group them all together.
- Crash categorization – such as NullPointerExceptions versus OutOfMemory errors on Android – because some types of crashes are fairly easy to fix, while others (like OutOfMemory errors) are often systemic and unactionable.
- Custom alerting with various criteria – like when we experience a new crash with this keyword, or when an old version of our app suddenly experiences new API errors.
It seemed like it’d be fairly straightforward to build our own application, using the data we were already collecting, to implement this functionality. We wanted to augment the data we receive from our external partner with data and couple it with our own internal tooling. We also wanted to provide any interested Etsy employees with a simple way to view the overall health of our apps. So that’s exactly what we chose to do.
Crashcan’s structure was a pretty wide-open space. All it really needed to do was provide crash ingestion from an API, process the crash a bit, and expose it via a simple user interface (it sounds a lot like many technologies, actually). So while the options for technologies and methodologies were open, we ultimately decided to keep it simple.
By using PHP, Etsy’s primary development language, we keep the barrier to entry for developers at Etsy low . We used as much modern PHP as possible, with Composer handling our dependency management. MySQL handles the data storage, with Doctrine ORM providing the interface to the database.
Ingesting the data was the first hurdle. Before handling anything else, we needed to make sure that we could actually (1) get the data we wanted and (2) keep up with the number of crashes that we wanted to ingest, without breaking down our system. After all, if you can’t get the data you want and you can’t do so reliably, there’s really no point.
After analyzing the API endpoints we had at our fingertips (yay, documentation!), we determined that we could get all the data we wanted. The architecture needed to allow us to:
- Determine whether we already have a crash (regardless of whether it has been deduplicated on our end)
- Keep track of deduplicated crashes, and link them to the originating crash from the external provider
- Run complex queries to combine data
- Analyze whether crashes are meeting specific thresholds, like whether a new crash has occurred at least n times
- Count crashes by category
- Filter everything by version range and other criteria
In the end, we developed a schema that allowed us to fulfill all those needs while remaining quick in response to queries:
To actually ingest the data from our external provider, we run a cron job every minute that checks for new crashes. This cron runs a simple loop – it loads new crashes from a paginated endpoint, looping through each page and each crash in turn. Each crash is added to a queue so that we can update it asynchronously.
We run a series of workers that run continuously, monitoring the queue for incoming crashes. As these workers run, they each pick a crash off the queue and processes it. This includes several steps, first checking whether we have the crash already, then updating it if we have it or creating a new crash if we don’t. We also go through each crash’s occurrences to make sure that we’re recording each one and tying it to an existing user if one exists. The flowchart below demonstrates how these workers process crashes.
Monitoring & Alerting
After building Crashcan’s initial design and getting crashes ingesting correctly, we quickly realized that we needed utilities to monitor the data ingestion and to alert us when something went wrong. Initially, we had to manually compare crash counts in Crashcan with those that our external provider offered in their user interface. Obviously, this was neither convenient nor sustainable, so we began integrating StatsD and Nagios. To check that we were still ingesting all our crashes, we also wrote a script to perform a sort of ‘spot-check’ of our data against our external provider’s – which fails if our data differs too much from theirs.
We created a simple dashboard, linked to StatsD, that allows us to see at-a-glance if the ingestion is going well – or if we’re encountering errors, slowness, or hitting our API limit. While we plan to improve our alerting infrastructure over time, this has been serving us well for now – though before we got our monitoring in a good state, we hit some big snags that kept us from being able to use Crashcan for weeks at a time. There’s an important lesson there: plan for monitoring and alerting from the beginning.
When deciding on Crashcan’s structure, we decided to focus first on building a stable, straightforward API. This would enable us to expose our crash data to both users and other applications – with one interface for accessing the data. This meant that it was simple for us to build Crashcan’s user interface as a Single Page Application. Very few of the disadvantages of single page applications applied in Crashcan’s case, since our users would only be other Etsy employees. Building a robust API also enabled us to share the data easily with other applications inside Etsy – most especially with our internal app release platform.
When an Etsy engineer accesses Crashcan, we aim to present them with the most broadly applicable information first – the overall health of an app. This is presented through an overview of frequent crashes, common crash categories, and new crashes, along with a graph showing all crashes for the app split out by version. This makes it much easier to spot big new crashes or problematic releases. The engineer then has the option to narrow the scope of their search and view the details of specific crashes.
While we’ve finished Crashcan v1 with much of the core functionality and gotten it in a stable enough state that we can depend on its data, there’s still quite a bit that we’d like to improve. For example, we haven’t even begun to implement a couple of the items we mentioned in our wish list, like custom alerting. Second, the user interface could do with some bugfixes and refinement. Right now, it’s in a mostly-usable state that other Etsy engineers can at least tolerate, but it’s not stable or refined enough that we’d be comfortable releasing it to a wider audience.
Additionally, our crash deduplication is still rudimentary. It only performs simple (and expensive) string comparisons to find similar crashes. We’d like to implement more advanced and more efficient crash deduplication using crash signature generation. This would give us a much more reliable way of determining when crashes are related, therefore providing a more accurate picture of how common each crash is.
Most of the pain points in Crashcan’s development weren’t new or especially unexpected, but they serve as a valuable reminder of some important considerations when building new applications.
- Build with monitoring and alerting in mind from the beginning. We could’ve avoided a several-week-long lapse in functionality had we focused on building in monitoring from the beginning.
- Don’t be afraid to consult with others on structural or technical decisions, and then just make a decision. It’s something that I’ve always struggled with – but getting blocked on making decisions or digging too deep into the minutiae of every decision is a great way to waste time.
- Document your assumptions – especially when dealing with APIs – as small assumptions can turn into big deals later on. This is what led to our biggest failure – we mistakenly assumed that crash timestamps were accurate. When a crash said it had occurred 4 days in the future, our app stopped updating, because it was checking only crashes occurring after that crash.
External search engines like Google and Bing are a major source of traffic for Etsy, especially for our longer-tail, harder to find items, and thus Search Engine Optimization (SEO) is important in driving efficient listing discovery on our platform.
We want to make sure that our SEO strategy is data-driven and that we can be highly confident that whatever changes we implement will bring about positive results. At Etsy, we constantly run experiments to optimize the user experience and discovery across our platform, and we therefore naturally turned to experimentation for improving our SEO performance. While it is relatively simple to set up an experiment on-site on our own pages and apps, running experiments with SEO required changing how Etsy’s pages appeared in search engine results, over which we did not have direct control.
To overcome this limitation, we designed a slightly modified experimental design framework that allows us to effectively test how changes to our pages affect our SEO performance. This post explains the methodology behind our SEO testing, the challenges we have come across, and how we have resolved them.
For one of our experiments, we hypothesized that changing the titles our pages displayed in search results (a.k.a. ‘title tags’) could increase their clickthrough rate. Etsy has millions of pages generated off of user generated content that were suitable for a test. Many of these pages also receive the majority of their traffic through SEO.
Below is an example of a template we used when setting up a recent SEO title tag experiment.
We were inspired by SEO tests at Pinterest and Thumbtack and decided to set up a similar experiment where we randomly assigned our pages into different groups and applied different title tag phrasings shown above. We would measure the success of each test group by how much traffic it drove relative to the control groups. In this experiment, we also set up two control groups to have a higher degree of confidence in our results and to be able to quality check our randomized sampling once the experiment began.
We took a small sample of pages of a similar type while ensuring that our sample was large enough to allow us to reach statistical significance within a reasonable amount of time.
Because visits to individual pages are highly volatile, with many outliers and fluctuations from day to day, we had to create relatively large groups of 1000 pages each to expect to reach significance quickly. Furthermore, because of the high degree of variance across our pages, simple random sampling of our pages into test groups was creating test groups different from each other in a statistically significant way even before the experiment began.
To ensure our test groups were more comparable to each other, we used stratified sampling, where we first ranked the the pages to be a part of the test by visits, broke them down into ntile groups and then randomly assigned the pages from each ntile group into one of the test groups, ensuring to take a page from each ntile group. This ensured that our test groups were consistently representative of the overall sample and more reliably similar to each other.
We then looked at the statistical metrics for each test group over the preceding time period, calculating the mean and standard deviation values by month and running t-tests to ensure the groups were not different from each other in a statistically significant way. All test groups passed this test.
Estimating Causal Impact
Although the test groups in our experiment were not different from each other at a statistically significant level before the experiment, there were small differences that prevented the estimation of the exact causal impact post treatment. For example, test group XYZ might see an increase relative to control B, but if Control B was slightly better than test groups XYZ even before the experiment began, simply taking the difference between of the two groups would not be the best estimate of the difference the treatment had effected.
One common approach to resolve this problem is to calculate the difference of differences between the test and control groups pre- and post-treatment.
While this approach would have worked well, it might have created two different estimated treatment effect sizes when comparing the test groups against the two different control groups. We decided that, instead, using Bayesian structural time series analysis to create a synthetic control group incorporating information from both the control groups would provide a cleaner analysis of the results.
In this approach, a machine learning model is trained using pre-treatment data to predict the performance of each test group based on its covariance relative to its predictors — in our case, the two control groups. Once the model is trained, it is used to generate the counterfactual, synthetic control groups for each of the test groups, simulating what would have happened had the treatment not been applied.
The causal impact analysis in this experiment was implemented using the CausalImpact package by Google.
We started seeing the effects of our test treatments as soon as a few days after the experiment start date. Even seemingly very subtle title tag changes resulted in large and statistically significant changes in traffic to our pages.
In some test groups, we saw significant gains in traffic.
While in others, we saw no change.
And in some others, we even saw a strong negative change in traffic.
The two control groups in this test showed no statistically significant difference compared to each other after the experiment. Although a slight change was detected, the effect did not reach significance.
Post-experiment rollout validation
Once we identified the best performing title tag, the treatment was rolled out across all test groups. The other groups experienced similar lifts in traffic and the variance across buckets disappeared, further validating our results.
The fact that our two control groups saw no change when compared to each other, and also the fact that the other buckets experienced the same improvement in performance once the best performing treatment was applied to them gave us strong basis for confidence in the validity of our results.
It appeared in our results that shorter title tags performed better than longer ones. This might be because for shorter, better targeted title tags, there is a higher probability of a percentage match (that could be calculated using a metric like the Levenshtein Distance between the search query and the title tag) against any given user’s search query on Google.
In a similar hypothesis, it might be that using well-targeted title tags that are more textually similar to common search terms helps to increase percentage match to Google search terms and therefore improves ranking.
However, it is likely that different strategies work well for different websites, and we would recommend rigorous testing to uncover the best SEO strategy tailored for each individual case.
- Have two control groups for A-A testing. This allowed us to have much greater confidence in our results.
- The CausalImpact package can be used to easily account for small differences in test vs. control groups and estimate the differences of treatments more accurately.
- For title tags, it is most likely a best practice to use phrasing and wording that would maximize the probability of a low Levenshtein distance match from popular target search queries on Google
Visualization of Stratified Sampling
This post is based on a talk and workshop that Toria and Ian gave at Etsy’s Dublin office in August.
Etsy has a strong set of beliefs that underpins our engineering culture. We believe in code as craft. We believe that if it moves, you should graph it. And we believe that when you’ve got some working code ready, you should “just ship” it.
This practice of “just shipping” is known as continuous deployment. We make small changes frequently, and we hide them behind “config flags” that let us test our work incrementally before a full feature launch. Etsy engineers collectively deploy code to the production site as many as 70 times per day.
Now imagine for a minute that you’re an engineer at an organization doing continuous deployment. You’ve got a small change ready to deploy. Your code is good. Tests pass. It’s all been reviewed. But every time you try to deploy, something goes wrong. This happens all the time, but only to you. Every time you try to deploy, you have to spend half an hour trying to fix the deploy system. No one else is motivated to fix anything because it works just fine for them. The deploy system is better for everyone because of your investigations, but fixing the deploy system isn’t part of your job. You just want to ship code!
What would be great is if some other engineers would pitch in and do the work too, so that you have more time to do your actual job. What you need are allies.
Surprise! That was a thinly-veiled metaphor for what it feels like to be a member of an underrepresented group trying to improve their work environment. Relying on members of minority groups to shoulder the burden of diversity issues is just as flawed as expecting one person to do all the work to fix a broken deploy system. You can’t excel at your job when you spend half your time dealing with other stuff. We need ways of spreading the load. We need allies. And we hope that’s why you’re reading this now.
So what is an ally? Let’s start by defining some important terms so that we’re all on the same page.
Women, men, and non-binary people
At Etsy, we recognize that gender is non-binary: it lies on a spectrum. When we use the term “men” here, we’re talking about anybody who identifies as a man and experiences the benefits of male privilege. When we say “women”, we’re talking about anybody who identifies as a woman. Some people don’t fall into either of these categories: they are non-binary. Gender discrimination impacts these people too, and as such you’ll see references to them throughout this post.
Much of the discrimination that people face depends on how society identifies their gender, rather than how they themselves identify. A person with a beard is likely to be treated like a man regardless of their chosen gender, but they still have to deal with bias and prejudice in their daily life.
“Feminism is the radical notion that women are people.” — Marie Shear
Of course women (and non-binary folks) are people. But while we think “of course they’re people”, we tend to overlook the countless ways in which society as a whole undervalues women and their work: lower wages for the same work or overall lower wages in industries dominated by women, portrayals of women as prizes to be won or objects to have, and the ignoring or ridiculing of problems faced by women, to name but a few.
As you learn more about feminism, another term you’ll see is “intersectional feminism”. Intersectionality is the recognition that people are complex beings with multiple axes of identity. Although we’re talking primarily about gender here, a person’s identity is not solely defined by their gender. Intersectional feminism acknowledges that we can’t solve problems for all women without considering that women have different experiences based on their race, religion, sexuality, gender expression, or able-bodiedness.
Good news! Allyship is also intersectional! If you’re white, you can serve as an ally to people of color. If you can see, you can serve as an ally to people with vision loss. If you’re a man, you can serve as an ally to women. If you’re cisgender, you can serve as an ally to folks who are trans, non-binary, or genderqueer.
Consider intersectionality throughout this post. Ask yourself how these techniques for allyship can be applied for other underrepresented groups.
The idea of privilege is often a massive stumbling block for people. We rebel against the idea that we have had an unfair advantage in life. “I had to work hard,” you’ll hear people claim. “I’ve struggled for everything I’ve got.”
Privilege does not mean you had it easy. It means you had it easier. If a man grows up in poverty, and drags himself out of it, that’s impressive. That’s hard. If he’d been a woman, he’d have had to do all the same things, while also fighting society’s expectations of what women can or should do. Privilege is what you don’t have to deal with.
In the opening example, everyone else ships more than you—not because they’re better than you, but because they don’t have to deal with the additional nonsense that you do.
Understanding privilege—and understanding and accepting your own privilege—is a vital part of becoming an effective ally. You’re not being asked to beat yourself up about it, you’re being asked to empathize with others who are less privileged so that you can do something about it.
Along with “privilege”, “patriarchy” is another term that trips people up. It brings to mind a shadowy cabal of men pulling strings and malevolently excluding women. This is… silly.
Instead, the term “patriarchy” refers to structural sexism and gender discrimination. We are raised in a society that historically and systematically favors men over women. This colors everything we do and everything we see. We’re surrounded by the fruits of this bias, steeped in it from birth. Just one example: studies show that, from an early age, girls are held to higher standards of politeness, while boys are expected to speak dominantly and assertively, producing power imbalances in conversations that continue through to our adult interactions.
Patriarchy perpetuates itself. Not through conscious malevolence (most of the time), but because male-dominated power structures tend to stack the deck against women gaining power, and so produce more male-dominated power structures.
The perpetuation of the patriarchy is rooted in unconscious bias. These are biases we don’t even realize we have, but which influence how we think and act. They are instilled in us over the years by repetitive stimuli from our environment.
Consider the following story: “A man and his son were in a car accident. The man died on the way to the hospital, but the boy was rushed into surgery. The surgeon said: ‘I can’t operate! That’s my son!’”
The first time most people are presented with this, they fail to realize the surgeon is the boy’s mother. Mental blind spots like this one show that we are all a little bit sexist. (As a side note, this thought experiment has been around for many years. In recent years, respondents have often thought the surgeon was the boy’s other father. They are more willing to accept a gay male couple than a female surgeon.)
Dr. Catherine Ashcraft from the National Center for Women and Information Technology (NCWIT) gave a lecture at Etsy on unconscious bias. She talked about some experiments for quantifying gender bias. The NCWIT staff took these tests, and all the participants were found to be unconsciously biased against women. To repeat: women, working for the National Center for Women and Information Technology, working to bring gender diversity to our sector, were all biased against women.
We are all trained, over time, to have these habitual, instinctive responses to situations. When these unconscious biases are challenged, we tend to react negatively. For example, women who adopt more traditionally male behaviors and speech patterns in the workplace are often perceived more negatively than women who fit society’s expectations.
What we can do, however, is make conscious corrections. We can actively try to overcome these unconscious biases.
The best way to combat unconscious bias is to recognize that it exists and identify when it’s happening. It’s easy to identify and “call out” overtly sexist behavior, but what about the more subtle and ambiguous stuff?
Casual phrases like “you’re really good at sports for a girl!” or “going out with the guys tonight; leaving the old ball and chain at home!”, using gendered phrases like “the ops guys”, speaking over women in meetings, repeating their ideas as your own, expecting them to do clerical work like note-taking, or standing over them at a desk in a dominant position: these are examples of microaggressions. They’re the “little things” that, examined individually, don’t always seem like a big enough deal to make a fuss over. “Maybe it was a joke?” “Maybe he didn’t mean it that way?” “It’s just an expression!”
But microaggressions are cumulative. Over time, these subtle comments build and reinforce traditional power structures by reminding women and non-binary individuals of their position in society.
We must notice these subtle, often unconscious microaggressions in others—and in ourselves—in order to correct them.
And that brings us to “ally”: the key part of this post. An ally is a member of a privileged group (in this case, men) who works to enable opportunity, access, and equality for members of a non-privileged group (in this case, women and non-binary people). They are using their privilege, their advantages, to bring about change.
How can allies help?
So… centuries—millennia!—of systematic discrimination against women. Biases baked into us from birth! Society fundamentally biased against women! This is an overwhelming problem. It’s hard to know where to start.
Like any large, complex problem, begin by breaking it down into smaller, more manageable parts. Start at your workplace. If you can make a difference there, you not only improve the lives of the less-privileged people you work with, but you also improve your working environment. Research shows that more diverse teams, with more diverse perspectives and experiences, make better decisions and build better products.
Start today. You’ve read this far, so you’re already interested in making a difference. Don’t wait until you’re an “expert” on feminist theory to start speaking up. Just start trying. And just like with continuous deployment, when you mess up (and everyone does), get feedback, listen, learn, fix the problem, and try again.
Now you’re ready to start, but as a member of a privileged group, what can you do? What do allies offer?
- Power and authority. In male-dominated power structures, men tend to have more powerful and influential positions and voices. Use those voices to speak on behalf of those with less privilege.
- Access. Our networks tend to look like ourselves, so men tend to have networks full of other (powerful) men. Provide access to these networks.
- Amplification. In addition to speaking up on behalf of women and non-binary individuals, men should amplify and endorse their words and achievements.
- Modeling. Allies model good behavior and interactions, such as talking openly with others about gender discrimination or being vocal about addressing their unconscious biases.
- Teaching. Expose other privileged people to the concepts you learn about. People from marginalized groups are often expected to do the teaching and allies can share the load.
Ten Steps to Being An Effective Ally
Being an ally is a constant learning experience. Being an ally isn’t a fixed state, it’s not a badge you earn (or take) and sew onto your sleeve and you’re an ally from then on. Being open to feedback and demonstrating that you’re willing to accept and learn from criticism is vital. More than anything, “ally” is a status accorded to you by those that you’re trying to help, based on your words and actions.
So, how do we ally?
1. Educate yourself
There are a ton of resources out there for you to learn from. Make the effort to educate yourself, rather than demanding that marginalized people explain things to you. You wouldn’t ask Rasmus Lerdorf, inventor of the PHP programming language, to explain basic PHP concepts. You would Google it. You would go out and find the articles, tutorials, and forum threads that already exist for beginners. There is already material for you to learn from: go out, find it, and read it. (We’ve created a reading list that would make a great starting point.)
While you’re reading, be aware that feminism isn’t a monolithic block of thought. There are a wide variety of viewpoints on the topic. Be sensitive to the possibility that what you’ve learned is just one viewpoint.
As an ally, you will never stop learning. Keep actively seeking out new writing and material so that you can deepen your understanding.
2. Expand your network
A great way to expand your understanding of feminism and gender issues is to expand and diversify your network. Make sure to follow your female and non-binary colleagues on social media. Then, make a habit of following the other folks they retweet or mention.
If you’d like to introduce yourself to a woman at your workplace or at a conference, do so! Just remember to keep the discussion technical and on-topic: talk to them because you’d like to know more about that new machine learning model they implemented, not because you need more diverse friends.
3. Listen and believe
Now that you have a good number of women and non-binary folks in your network, listen to them! Arguably the biggest thing you can do as an ally is to listen. Listen to the stories of the difficulties they’ve faced and the problems they’re experiencing in the workplace. When you hear their stories, especially ones that don’t fit with your mental model of your workplace or environment, believe them. No “aren’t you over-reacting?” No “I think you’ve misunderstood.” If they tell you there’s a problem, there’s a problem. So listen.
After listening, ask how you can help. Ask how you can support them in resolving the problem. It doesn’t have to be you doing the solving—your colleagues aren’t helpless damsels in distress—but your support can be invaluable.
One of the most difficult things to listen to is criticism of yourself and your actions. You still need to listen and believe and learn.
But just because you haven’t been told there’s a problem, that doesn’t mean there isn’t one. Speaking about experiences of discrimination is often very difficult, because it tends to be very, very risky. Marginalized people who report discrimination often find that doing so negatively impacts their careers. When they raise issues, they get labeled complainers or trouble-makers, while those they complain about see no consequences or repercussions for their actions.
Remember that you have no reason to expect that they will share their stories or concerns with you. These are not conversations that men, even well-meaning allies, should initiate. Don’t ask for these conversations, but when they happen, listen and believe.
4. Notice the small stuff
Your colleagues aren’t going to tell you about every bad experience; in fact, they won’t tell you about most of them. You can help by noticing problems by yourself and addressing them.
Microaggressions—the small stuff—are some of the most subtly toxic behaviors that women and non-binary people have to deal with. Microaggressions slowly eat away at their self-confidence and patience.
When you see some small inequity, mention it. If a colleague interrupts a woman, say, “I’d like to hear what __ was saying”. If a colleague assumes a woman will take notes, say, “I think __ could have some useful insights on this topic—could somebody else take notes so that she can participate more actively?” or possibly, “Have we considered a formal note-taking rotation to ensure that we’re not making gendered assumptions about who will do the clerical work?”
Try to also consider whether your comment will put the colleague who suffered the inequity in an uncomfortable position. If you’re not sure what to do, you should wait, talk to them privately, then defer to their decision on what further action should be taken. They may wish for you to speak with the person directly on their behalf or they may prefer for you to go to a manager. They may not want you to do anything at all (perhaps because they have plans to address this on their own). Sometimes the very recognition of the microaggression is enough! Remember: they don’t need you to save them, but your support and validation can be very valuable.
If you raise things like this with a colleague, it may feel “nitpicky”. It will certainly feel uncomfortable. But many productive and important conversations are uncomfortable! As an ally, you should be prepared to shoulder a bit of the discomfort and awkwardness that women and non-binary people experience every day.
You can also work on “anti-microaggressions”—small acts to nudge the culture in the opposite direction. Examples might include making sure a diverse range of people are featured in your illustrations, slide decks, user stories, etc., or that you pay attention to gendered language in your tools.
5. Teach others
Another way you can share the load is by teaching. Women and non-binary individuals are constantly expected to teach others about feminism and gender issues. It can be a great burden. Help them out by doing some of the teaching.
Have honest conversations with the people you work with, particularly if you observe behaviors that you know (or suspect) may have a discriminatory effect on your other colleagues. Remember that, most of the time, these behaviors are unconscious, or learned in different work environments. Talking about negative behaviors without blame and educating the men you work with helps them become better colleagues, and in the vast majority of cases it’ll be well-received.
In addition to educating other men, encourage them to speak up if they see instances of bias. The more men there are working on this, the easier it will be to make your workplace a more egalitarian environment.
6. Amplify and endorse
There is no point in having equality of numbers if there is no equality of influence. As such, we have to make sure that people from underrepresented groups are heard in meetings, that they have a chance to speak, and that their views are considered and respected. The frustration of not being able to contribute, or being ignored or belittled, is a fast track to quitting.
One type of unconscious bias is called “listener bias”. We are socialized to think that women talk more in general, and so tend to significantly overestimate the actual amount of time women spend talking in discussions, to the extent that we can think that women are dominating conversation when in fact men are doing most of the talking. As always, be aware of this unconscious bias. Correct for it by inviting your female and non-binary colleagues to offer their opinion in a meeting.
Make sure that women and non-binary individuals in your company have the opportunity to work on high-profile projects. If you make staffing decisions, pay attention to gender bias when considering who gets what role. If you’re not making those decisions, you can still advocate and lobby for them within your organization. Support and encourage them, but don’t micromanage them, or do all the work for them. Trust their expertise. You hired them, so they must be talented. If you don’t make use of their talents, not only do you lose out in the short term, but they’ll also eventually quit and you’ll lose out massively in the long term.
Also make sure they get credit for their accomplishments and contributions. Make sure they get to brag about what they’ve achieved. Approve of this behavior, rather than branding them as arrogant or conceited. Remember that society tends to consider modesty a virtue for women, but not men.
Amplify their voices outside of your workplace, too. If you’re invited to speak in public, ask yourself if there’s a woman or non-binary individual equally—or more!—qualified to speak on the topic. Pay attention to gender balance in panels and speaker line-ups at conferences you’re planning to participate in. Ask the organizer why their panel lacks diversity. Ask to see their Code of Conduct, and if they don’t have one, encourage them to change that. Consider not attending events without a code of conduct or refusing to sit on a panel that only includes men.
Social media is another excellent way to increase the visibility of underrepresented genders. If you’ve followed the advice earlier, you’re following women and non-binary folks on social media and have diversified your network, but consider also retweeting and promoting them. If they share a blog post, consider retweeting them instead of writing your own tweet with a link to the same content. Amplify their voices. Even small acts like retweeting can greatly increase their visibility and introduce your followers to more diverse opinions and ideas.
7. Recruit fairly
You know what else helps with gender diversity? Having more diverse people on staff! This might feel like it’s easier said than done, but there are concrete steps you can take to increase the gender diversity of your team.
The first step, which we’ve already addressed, is expanding your network. We tend to do a lot of recruitment from our personal networks, so having a diverse network can make a tremendous impact on the variety of candidates we can recruit.
Take the time and effort to review your job postings for gendered language: could your words make someone feel excluded or unqualified? Look at where your jobs are advertised: are you going to reach a diverse audience?
After you’ve established a diverse pool of applicants, you need to make sure the rest of the process is as fair and unbiased as possible.
When reviewing résumés, be explicitly aware of your unconscious biases to make sure you don’t filter candidates out for the wrong reasons. This doesn’t mean you’re purposely rejecting someone just because you think they’re female. Rather, you might reject someone because they haven’t described their accomplishments the way you might expect. Remember: women are conditioned to be modest and may under-report all the good stuff they’ve done.
There may be other reasons why they don’t conform to your preconceptions of the “ideal candidate”. For example, maybe you’d expect someone with their experience to have a long history of giving conference talks, but they haven’t been speaking at conferences because they perceive conferences as hostile environments.
When it comes to interview time, be mindful of the fact that there are a myriad of ways to be a successful employee and different candidates will excel in different environments. Using a diverse set of interview styles is beneficial for all candidates. Not everyone does well in aggressive “knowledge test”-style interviews. Some are better on a whiteboard, some are better at a keyboard, others respond well to discussion.
This is not to say that we should lower the bar for recruitment; rather, we should accept that we may be using the wrong measuring stick. Expecting everyone to act and respond in a particular way is the very opposite of recruiting for diverse viewpoints and experiences.
On the subject of recruiting women, it’s worth addressing “the pipeline problem”. This is the idea that we can’t hire more women because women aren’t studying computer science. This is somewhat correct, but entirely misleading. Women are not achieving computer science degrees at the same rate as men, it’s true, but the number of women active in the industry is much lower than the total number of women with relevant degrees (and that’s not counting the women who are capable self-taught programmers). Today, women earn 18% of CS degrees. In 1984, they earned 37% of CS degrees. These women are only in their 50s and still active in the industry. What happened to them? Clearly, the pipeline is not the only problem.
What good does it do us if we hire a load of great women and non-binary people, then they all quit because they arrive in a toxic work environment? What if the pipeline leads to a sewage plant?
8. Model and support sustainable work
In tech, particularly, women quit the industry completely with much higher frequency than men. They often leave not just because of sexist behaviors directly, but for a variety of complex reasons.
The expectations of the workplace can place an unreasonable load on all employees. Men are generally expected to meet those demands at the expense of family and personal life, while women are expected to do the opposite. The assumption that women will not have the time to meet these unreasonable demands is one way that society justifies the wage gap. Then, if a couple decides that one of them should stay home to care for the family, who do you think typically quits their job? The woman! Because we pay her less! But we pay her less because we expected her to leave!
In order to keep women in the industry, we need to pay them equally. More than that: we need to create a culture that supports sustainable work in a way that doesn’t pit employees’ personal and professional lives against each other. In doing so, companies invest in employees’ overall health, happiness, and engagement in their work. Your company may have unlimited vacation time or flexible working arrangements, but do your employees feel comfortable actually using those benefits?
Allies can help by actively participating in and supporting a sustainable work culture. They can normalize behaviors such as taking vacation, taking time for family, not working all hours, etc. Etsy’s CEO Chad Dickerson, for example, took full advantage of Etsy’s parental leave benefits (5 weeks at the time, now 26 weeks for all new parents) to help care for his family. More leaders should demonstrate that you can lead robust personal and professional lives that can enhance and support each other.
People of all genders should certainly still be able to opt out of the workplace to concentrate on their families, but that should be a choice, rather than an ultimatum.
9. Don’t lead, follow
Allies are there to share the load, not to take the lead. Allies simply haven’t lived the same experiences as those with whom they are allied. No amount of listening and learning will give you first-hand understanding of a person’s experiences!
Men are typically used to leading and taking charge, but women and non-binary individuals are perfectly capable of fighting their battles and defending themselves: they don’t need a man to step in to save them. What they need from men is support and understanding to make it easier, and for men to do their part so that eventually those battles don’t have to be fought in the first place.
10. Show up
Show up. Every day. Allyship isn’t something you can do in your spare time or only when it’s convenient for you. It’s effort, it’s work—often hard work. Show up, every day, and don’t let it slip.
Showing up includes a healthy dose of self-reflection and self-awareness. Think carefully about your own actions and behaviors—remember that unconscious bias is deeply entrenched and will rear up when you least expect it.
And don’t stop at supporting women and non-binary people at work. Learn about the issues faced by other underrepresented groups and how to apply your allyship skills to supporting them too.
Don’t expect a cookie, though. Actively working to correct injustices should be the baseline, not something special you deserve to be rewarded for. Do the work because the work matters, not because it looks good on your résumé, and give credit to those who helped you get there.
Being an ally is hard. It takes time and work and effort. Fundamentally, men could avoid this time and work and effort. Society doesn’t expect men to be allies. Men have the privilege of being able to ignore these problems if they want to. We hope this post has helped to persuade you that being an ally is important, but also achievable. You can make a difference—a huge difference—if you step up.
The material for this post was inspired (and immeasurably improved) by many women and non-binary people—at Etsy and beyond—who shared their knowledge and experience with us. We’re grateful for their time and effort.
We’d also like to acknowledge the contributions and feedback from men at Etsy who have reflected on their successes—and failures—as allies and shared what they’ve learned.
We also owe a debt to some of the resources made available by NCWIT and The Ada Initiative, as well as the countless people who have written books, blog posts, and talks that have helped us gain a better understanding of this complex topic.
This post references a number of external studies and articles on the research behind issues of diversity in tech and society in general, which are listed below. For more information on the business of allyship, check out our list of recommended reading for allies.
- Understanding the Gender Pay Gap, Payscale
- Argument Cultures and Unregulated Aggression, Heddleston
- Women bosses more likely to be called ‘bitchy’, ’emotional’ and ‘bossy’, Sheffield
- The abrasiveness trap, Snyder
- We’re Making the Wrong Case for Diversity in Silicon Valley, Pittinsky
- The Persistence of Retaliation Against Employees, Knezevich
- I’m a Slack designer, and my world changed when I made an emoji with brown skin like mine, Brito
- Save us, Princess!, Lettvin
- Women at the White House have started using a simple, clever trick to get heard, Werber
- Prattle of the sexes, Hammond
- Speaker sex and perceived apportionment of talk, Cutler, Scott
- Code of Conduct 101 + FAQ, Dryden
- Why Women Don’t Apply for Jobs Unless They’re 100% Qualified, Mohr
- If you think women in tech is just a pipeline problem, you haven’t been paying attention, Thomas
- Women computer science grads: The bump before the decline, Mitchell
- Women In Tech: The Facts, Ashcraft, McClain, Eger
- Why men fear paternity Leave, Paquette
- Strong Families, Strong Business: A Step Forward in Parental Leave at Etsy, Gorman