SEO Title Tag Optimization at Etsy: Experimental Design and Causal Inference

Posted by on October 25, 2016 / 65 Comments

External search engines like Google and Bing are a major source of traffic for Etsy, especially for our longer-tail, harder to find items, and thus Search Engine Optimization (SEO) is important in driving efficient listing discovery on our platform.

We want to make sure that our SEO strategy is data-driven and that we can be highly confident that whatever changes we implement will bring about positive results. At Etsy, we constantly run experiments to optimize the user experience and discovery across our platform, and we therefore naturally turned to experimentation for improving our SEO performance. While it is relatively simple to set up an experiment on-site on our own pages and apps, running experiments with SEO required changing how Etsy’s pages appeared in search engine results, over which we did not have direct control.

To overcome this limitation, we designed a slightly modified experimental design framework that allows us to effectively test how changes to our pages affect our SEO performance. This post explains the methodology behind our SEO testing, the challenges we have come across, and how we have resolved them.

Experiment Methodology

For one of our experiments, we hypothesized that changing the titles our pages displayed in search results (a.k.a. ‘title tags’) could increase their clickthrough rate. Etsy has millions of pages generated off of user generated content that were suitable for a test. Many of these pages also receive the majority of their traffic through SEO.

Below is an example of a template we used when setting up a recent SEO title tag experiment.


We were inspired by SEO tests at Pinterest and Thumbtack and decided to set up a similar experiment where we randomly assigned our pages into different groups and applied different title tag phrasings shown above. We would measure the success of each test group by how much traffic it drove relative to the control groups. In this experiment, we also set up two control groups to have a higher degree of confidence in our results and to be able to quality check our randomized sampling once the experiment began.


We took a small sample of pages of a similar type while ensuring that our sample was large enough  to allow us to reach statistical significance within a reasonable amount of time.


Because visits to individual pages are highly volatile, with many outliers and fluctuations from day to day, we had to create relatively large groups of 1000 pages each to expect to reach significance quickly. Furthermore, because of the high degree of variance across our pages, simple random sampling of our pages into test groups was creating test groups different from each other in a statistically significant way even before the experiment began.

To ensure our test groups were more comparable to each other, we used stratified sampling, where we first ranked the the pages to be a part of the test by visits, broke them down into ntile groups and then randomly assigned the pages from each ntile group into one of the test groups, ensuring to take a page from each ntile group. This ensured that our test groups were consistently representative of the overall sample and more reliably similar to each other.


We then looked at the statistical metrics for each test group over the preceding time period, calculating the mean and standard deviation values by month and running t-tests to ensure the groups were not different from each other in a statistically significant way. All test groups passed this test.


Estimating Causal Impact

Although the test groups in our experiment were not different from each other at a statistically significant level before the experiment, there were small differences that prevented the estimation of the exact causal impact post treatment. For example, test group XYZ might see an increase relative to control B, but if Control B was slightly better than test groups XYZ even before the experiment began, simply taking the difference between of the two groups would not be the best estimate of the difference the treatment had effected.

One common approach to resolve this problem is to calculate the difference of differences between the test and control groups pre- and post-treatment.

While this approach would have worked well, it might have created two different estimated treatment effect sizes when comparing the test groups against the two different control groups. We decided that, instead, using Bayesian structural time series analysis to create a synthetic control group incorporating information from both the control groups would provide a cleaner analysis of the results.

In this approach, a machine learning model is trained using pre-treatment data to predict the performance of each test group based on its covariance relative to its predictors — in our case, the two control groups. Once the model is trained, it is used to generate the counterfactual, synthetic control groups for each of the test groups, simulating what would have happened had the treatment not been applied.

The causal impact analysis in this experiment was implemented using the CausalImpact package by Google.


We started seeing the effects of our test treatments as soon as a few days after the experiment start date. Even seemingly very subtle title tag changes resulted in large and statistically significant changes in traffic to our pages.

In some test groups, we saw significant gains in traffic.


While in others, we saw no change.


And in some others, we even saw a strong negative change in traffic.


A-A Testing

The two control groups in this test showed no statistically significant difference compared to each other after the experiment. Although a slight change was detected, the effect did not reach significance.

Post-experiment rollout validation

Once we identified the best performing title tag, the treatment was rolled out across all test groups. The other groups experienced similar lifts in traffic and the variance across buckets disappeared, further validating our results.


The fact that our two control groups saw no change when compared to each other, and also the fact that the other buckets experienced the same improvement in performance once the best performing treatment was applied to them gave us strong basis for confidence in the validity of our results.


It appeared in our results that shorter title tags performed better than longer ones. This might be because for shorter, better targeted title tags, there is a higher probability of a percentage match (that could be calculated using a metric like the Levenshtein Distance between the search query and the title tag) against any given user’s search query on Google.

In a similar hypothesis, it might be that using well-targeted title tags that are more textually similar to common search terms helps to increase percentage match to Google search terms and therefore improves ranking.

However, it is likely that different strategies work well for different websites, and we would recommend rigorous testing to uncover the best SEO strategy tailored for each individual case.



Image Credits:

Visualization of Stratified Sampling


Being an Effective Ally to Women and Non-Binary People

Posted by and on October 19, 2016 / 6 Comments

This post is based on a talk and workshop that Toria and Ian gave at Etsy’s Dublin office in August.

Etsy has a strong set of beliefs that underpins our engineering culture. We believe in code as craft. We believe that if it moves, you should graph it. And we believe that when you’ve got some working code ready, you should “just ship” it.

This practice of “just shipping” is known as continuous deployment. We make small changes frequently, and we hide them behind “config flags” that let us test our work incrementally before a full feature launch. Etsy engineers collectively deploy code to the production site as many as 70 times per day.

Now imagine for a minute that you’re an engineer at an organization doing continuous deployment. You’ve got a small change ready to deploy. Your code is good. Tests pass. It’s all been reviewed. But every time you try to deploy, something goes wrong. This happens all the time, but only to you. Every time you try to deploy, you have to spend half an hour trying to fix the deploy system. No one else is motivated to fix anything because it works just fine for them. The deploy system is better for everyone because of your investigations, but fixing the deploy system isn’t part of your job. You just want to ship code!

What would be great is if some other engineers would pitch in and do the work too, so that you have more time to do your actual job. What you need are allies.

Surprise! That was a thinly-veiled metaphor for what it feels like to be a member of an underrepresented group trying to improve their work environment. Relying on members of minority groups to shoulder the burden of diversity issues is just as flawed as expecting one person to do all the work to fix a broken deploy system. You can’t excel at your job when you spend half your time dealing with other stuff. We need ways of spreading the load. We need allies. And we hope that’s why you’re reading this now.


So what is an ally? Let’s start by defining some important terms so that we’re all on the same page.

Women, men, and non-binary people

At Etsy, we recognize that gender is non-binary: it lies on a spectrum. When we use the term “men” here, we’re talking about anybody who identifies as a man and experiences the benefits of male privilege. When we say “women”, we’re talking about anybody who identifies as a woman. Some people don’t fall into either of these categories: they are non-binary. Gender discrimination impacts these people too, and as such you’ll see references to them throughout this post.

Much of the discrimination that people face depends on how society identifies their gender, rather than how they themselves identify. A person with a beard is likely to be treated like a man regardless of their chosen gender, but they still have to deal with bias and prejudice in their daily life.


“Feminism is the radical notion that women are people.” — Marie Shear

Of course women (and non-binary folks) are people. But while we think “of course they’re people”, we tend to overlook the countless ways in which society as a whole undervalues women and their work: lower wages for the same work or overall lower wages in industries dominated by women, portrayals of women as prizes to be won or objects to have, and the ignoring or ridiculing of problems faced by women, to name but a few.

Intersectional feminism

As you learn more about feminism, another term you’ll see is “intersectional feminism”. Intersectionality is the recognition that people are complex beings with multiple axes of identity. Although we’re talking primarily about gender here, a person’s identity is not solely defined by their gender. Intersectional feminism acknowledges that we can’t solve problems for all women without considering that women have different experiences based on their race, religion, sexuality, gender expression, or able-bodiedness.

Good news! Allyship is also intersectional! If you’re white, you can serve as an ally to people of color. If you can see, you can serve as an ally to people with vision loss. If you’re a man, you can serve as an ally to women. If you’re cisgender, you can serve as an ally to folks who are trans, non-binary, or genderqueer.

Consider intersectionality throughout this post. Ask yourself how these techniques for allyship can be applied for other underrepresented groups.


The idea of privilege is often a massive stumbling block for people. We rebel against the idea that we have had an unfair advantage in life. “I had to work hard,” you’ll hear people claim. “I’ve struggled for everything I’ve got.”

Privilege does not mean you had it easy. It means you had it easier. If a man grows up in poverty, and drags himself out of it, that’s impressive. That’s hard. If he’d been a woman, he’d have had to do all the same things, while also fighting society’s expectations of what women can or should do. Privilege is what you don’t have to deal with.

In the opening example, everyone else ships more than you—not because they’re better than you, but because they don’t have to deal with the additional nonsense that you do.

Understanding privilege—and understanding and accepting your own privilege—is a vital part of becoming an effective ally. You’re not being asked to beat yourself up about it, you’re being asked to empathize with others who are less privileged so that you can do something about it.


Along with “privilege”, “patriarchy” is another term that trips people up. It brings to mind a shadowy cabal of men pulling strings and malevolently excluding women. This is… silly.

Instead, the term “patriarchy” refers to structural sexism and gender discrimination. We are raised in a society that historically and systematically favors men over women. This colors everything we do and everything we see. We’re surrounded by the fruits of this bias, steeped in it from birth. Just one example: studies show that, from an early age, girls are held to higher standards of politeness, while boys are expected to speak dominantly and assertively, producing power imbalances in conversations that continue through to our adult interactions.

Patriarchy perpetuates itself. Not through conscious malevolence (most of the time), but because male-dominated power structures tend to stack the deck against women gaining power, and so produce more male-dominated power structures.

Unconscious bias

The perpetuation of the patriarchy is rooted in unconscious bias. These are biases we don’t even realize we have, but which influence how we think and act. They are instilled in us over the years by repetitive stimuli from our environment.

Consider the following story: “A man and his son were in a car accident. The man died on the way to the hospital, but the boy was rushed into surgery. The surgeon said: ‘I can’t operate! That’s my son!’”

The first time most people are presented with this, they fail to realize the surgeon is the boy’s mother. Mental blind spots like this one show that we are all a little bit sexist. (As a side note, this thought experiment has been around for many years. In recent years, respondents have often thought the surgeon was the boy’s other father. They are more willing to accept a gay male couple than a female surgeon.)

Dr. Catherine Ashcraft from the National Center for Women and Information Technology (NCWIT) gave a lecture at Etsy on unconscious bias. She talked about some experiments for quantifying gender bias. The NCWIT staff took these tests, and all the participants were found to be unconsciously biased against women. To repeat: women, working for the National Center for Women and Information Technology, working to bring gender diversity to our sector, were all biased against women.

We are all trained, over time, to have these habitual, instinctive responses to situations. When these unconscious biases are challenged, we tend to react negatively. For example, women who adopt more traditionally male behaviors and speech patterns in the workplace are often perceived more negatively than women who fit society’s expectations.

What we can do, however, is make conscious corrections. We can actively try to overcome these unconscious biases.


The best way to combat unconscious bias is to recognize that it exists and identify when it’s happening. It’s easy to identify and “call out” overtly sexist behavior, but what about the more subtle and ambiguous stuff?

Casual phrases like “you’re really good at sports for a girl!” or “going out with the guys tonight; leaving the old ball and chain at home!”, using gendered phrases like “the ops guys”, speaking over women in meetings, repeating their ideas as your own, expecting them to do clerical work like note-taking, or standing over them at a desk in a dominant position: these are examples of microaggressions. They’re the “little things” that, examined individually, don’t always seem like a big enough deal to make a fuss over. “Maybe it was a joke?” “Maybe he didn’t mean it that way?” “It’s just an expression!”

But microaggressions are cumulative. Over time, these subtle comments build and reinforce traditional power structures by reminding women and non-binary individuals of their position in society.

We must notice these subtle, often unconscious microaggressions in others—and in ourselves—in order to correct them.


And that brings us to “ally”: the key part of this post. An ally is a member of a privileged group (in this case, men) who works to enable opportunity, access, and equality for members of a non-privileged group (in this case, women and non-binary people). They are using their privilege, their advantages, to bring about change.

How can allies help?

So… centuries—millennia!—of systematic discrimination against women. Biases baked into us from birth! Society fundamentally biased against women! This is an overwhelming problem. It’s hard to know where to start.

Like any large, complex problem, begin by breaking it down into smaller, more manageable parts. Start at your workplace. If you can make a difference there, you not only improve the lives of the less-privileged people you work with, but you also improve your working environment. Research shows that more diverse teams, with more diverse perspectives and experiences, make better decisions and build better products.

Start today. You’ve read this far, so you’re already interested in making a difference. Don’t wait until you’re an “expert” on feminist theory to start speaking up. Just start trying. And just like with continuous deployment, when you mess up (and everyone does), get feedback, listen, learn, fix the problem, and try again.

Now you’re ready to start, but as a member of a privileged group, what can you do? What do allies offer?

Ten Steps to Being An Effective Ally

Being an ally is a constant learning experience. Being an ally isn’t a fixed state, it’s not a badge you earn (or take) and sew onto your sleeve and you’re an ally from then on. Being open to feedback and demonstrating that you’re willing to accept and learn from criticism is vital. More than anything, “ally” is a status accorded to you by those that you’re trying to help, based on your words and actions.

So, how do we ally?

1. Educate yourself

There are a ton of resources out there for you to learn from. Make the effort to educate yourself, rather than demanding that marginalized people explain things to you. You wouldn’t ask Rasmus Lerdorf, inventor of the PHP programming language, to explain basic PHP concepts. You would Google it. You would go out and find the articles, tutorials, and forum threads that already exist for beginners. There is already material for you to learn from: go out, find it, and read it. (We’ve created a reading list that would make a great starting point.)

While you’re reading, be aware that feminism isn’t a monolithic block of thought. There are a wide variety of viewpoints on the topic. Be sensitive to the possibility that what you’ve learned is just one viewpoint.

As an ally, you will never stop learning. Keep actively seeking out new writing and material so that you can deepen your understanding.

2. Expand your network

A great way to expand your understanding of feminism and gender issues is to expand and diversify your network. Make sure to follow your female and non-binary colleagues on social media. Then, make a habit of following the other folks they retweet or mention.

If you’d like to introduce yourself to a woman at your workplace or at a conference, do so! Just remember to keep the discussion technical and on-topic: talk to them because you’d like to know more about that new machine learning model they implemented, not because you need more diverse friends.

3. Listen and believe

Now that you have a good number of women and non-binary folks in your network, listen to them! Arguably the biggest thing you can do as an ally is to listen. Listen to the stories of the difficulties they’ve faced and the problems they’re experiencing in the workplace. When you hear their stories, especially ones that don’t fit with your mental model of your workplace or environment, believe them. No “aren’t you over-reacting?” No “I think you’ve misunderstood.” If they tell you there’s a problem, there’s a problem. So listen.

After listening, ask how you can help. Ask how you can support them in resolving the problem. It doesn’t have to be you doing the solving—your colleagues aren’t helpless damsels in distress—but your support can be invaluable.

One of the most difficult things to listen to is criticism of yourself and your actions. You still need to listen and believe and learn.

But just because you haven’t been told there’s a problem, that doesn’t mean there isn’t one. Speaking about experiences of discrimination is often very difficult, because it tends to be very, very risky. Marginalized people who report discrimination often find that doing so negatively impacts their careers. When they raise issues, they get labeled complainers or trouble-makers, while those they complain about see no consequences or repercussions for their actions.

Remember that you have no reason to expect that they will share their stories or concerns with you. These are not conversations that men, even well-meaning allies, should initiate. Don’t ask for these conversations, but when they happen, listen and believe.

4. Notice the small stuff

Your colleagues aren’t going to tell you about every bad experience; in fact, they won’t tell you about most of them. You can help by noticing problems by yourself and addressing them.

Microaggressions—the small stuff—are some of the most subtly toxic behaviors that women and non-binary people have to deal with. Microaggressions slowly eat away at their self-confidence and patience.

When you see some small inequity, mention it. If a colleague interrupts a woman, say, “I’d like to hear what __ was saying”. If a colleague assumes a woman will take notes, say, “I think __ could have some useful insights on this topic—could somebody else take notes so that she can participate more actively?” or possibly, “Have we considered a formal note-taking rotation to ensure that we’re not making gendered assumptions about who will do the clerical work?”

Try to also consider whether your comment will put the colleague who suffered the inequity in an uncomfortable position. If you’re not sure what to do, you should wait, talk to them privately, then defer to their decision on what further action should be taken. They may wish for you to speak with the person directly on their behalf or they may prefer for you to go to a manager. They may not want you to do anything at all (perhaps because they have plans to address this on their own). Sometimes the very recognition of the microaggression is enough! Remember: they don’t need you to save them, but your support and validation can be very valuable.

If you raise things like this with a colleague, it may feel “nitpicky”. It will certainly feel uncomfortable. But many productive and important conversations are uncomfortable! As an ally, you should be prepared to shoulder a bit of the discomfort and awkwardness that women and non-binary people experience every day.

You can also work on “anti-microaggressions”—small acts to nudge the culture in the opposite direction. Examples might include making sure a diverse range of people are featured in your illustrations, slide decks, user stories, etc., or that you pay attention to gendered language in your tools.

5. Teach others

Another way you can share the load is by teaching. Women and non-binary individuals are constantly expected to teach others about feminism and gender issues. It can be a great burden. Help them out by doing some of the teaching.

Have honest conversations with the people you work with, particularly if you observe behaviors that you know (or suspect) may have a discriminatory effect on your other colleagues. Remember that, most of the time, these behaviors are unconscious, or learned in different work environments. Talking about negative behaviors without blame and educating the men you work with helps them become better colleagues, and in the vast majority of cases it’ll be well-received.

In addition to educating other men, encourage them to speak up if they see instances of bias. The more men there are working on this, the easier it will be to make your workplace a more egalitarian environment.

6. Amplify and endorse

There is no point in having equality of numbers if there is no equality of influence. As such, we have to make sure that people from underrepresented groups are heard in meetings, that they have a chance to speak, and that their views are considered and respected. The frustration of not being able to contribute, or being ignored or belittled, is a fast track to quitting.

One type of unconscious bias is called “listener bias”. We are socialized to think that women talk more in general, and so tend to significantly overestimate the actual amount of time women spend talking in discussions, to the extent that we can think that women are dominating conversation when in fact men are doing most of the talking. As always, be aware of this unconscious bias. Correct for it by inviting your female and non-binary colleagues to offer their opinion in a meeting.

Make sure that women and non-binary individuals in your company have the opportunity to work on high-profile projects. If you make staffing decisions, pay attention to gender bias when considering who gets what role. If you’re not making those decisions, you can still advocate and lobby for them within your organization. Support and encourage them, but don’t micromanage them, or do all the work for them. Trust their expertise. You hired them, so they must be talented. If you don’t make use of their talents, not only do you lose out in the short term, but they’ll also eventually quit and you’ll lose out massively in the long term.

Also make sure they get credit for their accomplishments and contributions. Make sure they get to brag about what they’ve achieved. Approve of this behavior, rather than branding them as arrogant or conceited. Remember that society tends to consider modesty a virtue for women, but not men.

Amplify their voices outside of your workplace, too. If you’re invited to speak in public, ask yourself if there’s a woman or non-binary individual equally—or more!—qualified to speak on the topic. Pay attention to gender balance in panels and speaker line-ups at conferences you’re planning to participate in. Ask the organizer why their panel lacks diversity. Ask to see their Code of Conduct, and if they don’t have one, encourage them to change that. Consider not attending events without a code of conduct or refusing to sit on a panel that only includes men.

Social media is another excellent way to increase the visibility of underrepresented genders. If you’ve followed the advice earlier, you’re following women and non-binary folks on social media and have diversified your network, but consider also retweeting and promoting them. If they share a blog post, consider retweeting them instead of writing your own tweet with a link to the same content. Amplify their voices. Even small acts like retweeting can greatly increase their visibility and introduce your followers to more diverse opinions and ideas.

7. Recruit fairly

You know what else helps with gender diversity? Having more diverse people on staff! This might feel like it’s easier said than done, but there are concrete steps you can take to increase the gender diversity of your team.

The first step, which we’ve already addressed, is expanding your network. We tend to do a lot of recruitment from our personal networks, so having a diverse network can make a tremendous impact on the variety of candidates we can recruit.

Take the time and effort to review your job postings for gendered language: could your words make someone feel excluded or unqualified? Look at where your jobs are advertised: are you going to reach a diverse audience?

After you’ve established a diverse pool of applicants, you need to make sure the rest of the process is as fair and unbiased as possible.

When reviewing résumés, be explicitly aware of your unconscious biases to make sure you don’t filter candidates out for the wrong reasons. This doesn’t mean you’re purposely rejecting someone just because you think they’re female. Rather, you might reject someone because they haven’t described their accomplishments the way you might expect. Remember: women are conditioned to be modest and may under-report all the good stuff they’ve done.

There may be other reasons why they don’t conform to your preconceptions of the “ideal candidate”. For example, maybe you’d expect someone with their experience to have a long history of giving conference talks, but they haven’t been speaking at conferences because they perceive conferences as hostile environments.

When it comes to interview time, be mindful of the fact that there are a myriad of ways to be a successful employee and different candidates will excel in different environments. Using a diverse set of interview styles is beneficial for all candidates. Not everyone does well in aggressive “knowledge test”-style interviews. Some are better on a whiteboard, some are better at a keyboard, others respond well to discussion.

This is not to say that we should lower the bar for recruitment; rather, we should accept that we may be using the wrong measuring stick. Expecting everyone to act and respond in a particular way is the very opposite of recruiting for diverse viewpoints and experiences.

On the subject of recruiting women, it’s worth addressing “the pipeline problem”. This is the idea that we can’t hire more women because women aren’t studying computer science. This is somewhat correct, but entirely misleading. Women are not achieving computer science degrees at the same rate as men, it’s true, but the number of women active in the industry is much lower than the total number of women with relevant degrees (and that’s not counting the women who are capable self-taught programmers). Today, women earn 18% of CS degrees. In 1984, they earned 37% of CS degrees. These women are only in their 50s and still active in the industry. What happened to them? Clearly, the pipeline is not the only problem.

What good does it do us if we hire a load of great women and non-binary people, then they all quit because they arrive in a toxic work environment? What if the pipeline leads to a sewage plant?

8. Model and support sustainable work

In tech, particularly, women quit the industry completely with much higher frequency than men. They often leave not just because of sexist behaviors directly, but for a variety of complex reasons.

The expectations of the workplace can place an unreasonable load on all employees. Men are generally expected to meet those demands at the expense of family and personal life, while women are expected to do the opposite. The assumption that women will not have the time to meet these unreasonable demands is one way that society justifies the wage gap. Then, if a couple decides that one of them should stay home to care for the family, who do you think typically quits their job? The woman! Because we pay her less! But we pay her less because we expected her to leave!

In order to keep women in the industry, we need to pay them equally. More than that: we need to create a culture that supports sustainable work in a way that doesn’t pit employees’ personal and professional lives against each other. In doing so, companies invest in employees’ overall health, happiness, and engagement in their work. Your company may have unlimited vacation time or flexible working arrangements, but do your employees feel comfortable actually using those benefits?

Allies can help by actively participating in and supporting a sustainable work culture. They can normalize behaviors such as taking vacation, taking time for family, not working all hours, etc. Etsy’s CEO Chad Dickerson, for example, took full advantage of Etsy’s parental leave benefits (5 weeks at the time, now 26 weeks for all new parents) to help care for his family. More leaders should demonstrate that you can lead robust personal and professional lives that can enhance and support each other.

People of all genders should certainly still be able to opt out of the workplace to concentrate on their families, but that should be a choice, rather than an ultimatum.

9. Don’t lead, follow

Allies are there to share the load, not to take the lead. Allies simply haven’t lived the same experiences as those with whom they are allied. No amount of listening and learning will give you first-hand understanding of a person’s experiences!

Men are typically used to leading and taking charge, but women and non-binary individuals are perfectly capable of fighting their battles and defending themselves: they don’t need a man to step in to save them. What they need from men is support and understanding to make it easier, and for men to do their part so that eventually those battles don’t have to be fought in the first place.

10. Show up

Show up. Every day. Allyship isn’t something you can do in your spare time or only when it’s convenient for you. It’s effort, it’s work—often hard work. Show up, every day, and don’t let it slip.

Showing up includes a healthy dose of self-reflection and self-awareness. Think carefully about your own actions and behaviors—remember that unconscious bias is deeply entrenched and will rear up when you least expect it.

And don’t stop at supporting women and non-binary people at work. Learn about the issues faced by other underrepresented groups and how to apply your allyship skills to supporting them too.

Don’t expect a cookie, though. Actively working to correct injustices should be the baseline, not something special you deserve to be rewarded for. Do the work because the work matters, not because it looks good on your résumé, and give credit to those who helped you get there.

Being an ally is hard. It takes time and work and effort. Fundamentally, men could avoid this time and work and effort. Society doesn’t expect men to be allies. Men have the privilege of being able to ignore these problems if they want to. We hope this post has helped to persuade you that being an ally is important, but also achievable. You can make a difference—a huge difference—if you step up.


The material for this post was inspired (and immeasurably improved) by many women and non-binary people—at Etsy and beyond—who shared their knowledge and experience with us. We’re grateful for their time and effort.

We’d also like to acknowledge the contributions and feedback from men at Etsy who have reflected on their successes—and failures—as allies and shared what they’ve learned.

We also owe a debt to some of the resources made available by NCWIT and The Ada Initiative, as well as the countless people who have written books, blog posts, and talks that have helped us gain a better understanding of this complex topic.


This post references a number of external studies and articles on the research behind issues of diversity in tech and society in general, which are listed below. For more information on the business of allyship, check out our list of recommended reading for allies.


API First Transformation at Etsy – Operations

Posted by on September 26, 2016 / 6 Comments

This is the second post in a series of three about Etsy’s API, the abstract interface to our logic and data. The previous post is about concurrency in Etsy’s API infrastructure. This post covers the operational side of the API infrastructure.

Operations: Architecture Implications

How do the decisions for developing Etsy’s API that we discussed in the first post relate to Etsy’s general architecture? We’re all about Do It Yourself at Etsy. A cloud is just other people’s computers, and not in the spirit of DIY; that’s why we rather run our own datacenter with our own hardware.

Also, Etsy kicks it old school and runs on a LAMP stack. Linux, Apache, MySQL, PHP. We’ve already talked about PHP being a strictly sequential, single-threaded, shared-nothing environment, leading to our choice of parallel cURL. In the PHP world, everything runs through a front controller, for example index.php. In that file, we have to include other PHP files if we need them, and to make that easier, we usually use an autoloader to help with dependencies.

Every web request gets a new PHP environment in its own instance of the PHP interpreter. The process of setting up that environment is called bootstrap. This bootstrapping process is a fixed cost in terms of CPU work, regardless of the actual work required by the request. By enabling multiple, concurrent HTTP sub-requests to fetch data for a single client request, this fixed cost was multiplied. Additionally, this concurrency encouraged more work to be done within the same wall clock time. Developers built more diverse features and experiences, but at the cost of using more back-end resources. We had a problem.

Problem: PHP time to request + racking servers D:

As more teams adopted the new approach to build features in our apps and on the web, more and more backend resources were being consumed, primarily in terms of CPU usage from PHP. In response, we added more compute capacity, over time growing the API to four times the number of servers prior to API v3. Continuing down this path we would have exhausted space and power in our datacenters. This was not a long term solution.
To solve this, we tried several strategies at once. First, we skipped some work by allowing to mark some subrequests as optional. This approach was abandoned because people used it as a graceful error recovery mechanism, triggering an alternate subrequest, rather than for optional data fetches. This didn’t help us reduce the overall work required for a given client request.

Also, we spent some time optimizing the bootstrap process. The bootstrap tax is paid by all requests and subrequests, making it a good place to focus our efforts. This initially showed benefit with some low hanging fruit, but it was a moving target in a changing codebase, requiring constant work to maintain a low bootstrap tax. We needed other ways of doing less work.

A big step forward was the introduction of HTTP response caching. We had to add caching quickly, and first tried the same cache we use for image serving, Apache Traffic Server. While being great for caching large image files, it didn’t work as well for smaller, latency sensitive API responses. We settled on Varnish, which is fast and easy to configure for our needs. Not all endpoints are being cached, but for cached ones, Varnish will serve the same response many times. We accept staleness for a small 10 – 15 minute period, drastically reducing the amount of work required for these requests. For the cacheable case, Varnish handles thousands of requests per second with a 80% hit rate. Because the API framework requires input parameters to be explicit in the HTTP request, this meshed well with introducing the caching tier. The framework also transparently handles locale, passing the user’s language, currency and region with every subrequest, which Varnish uses to manage variants.

The biggest step forward came from a courageous experiment. Dan from the core team looked at bigger organizations that faced the same problem, and tried out facebook’s hhvm on our API cluster. And got a rocketship. We could do the same work, but faster, solving this issue for us entirely. The performance gain from hhvm was a catalyst for performance improvements that made it into PHP7. We are now completely switched over to PHP7 everywhere, but it’s unclear what we would have done without hhvm back in the day.

In conclusion, concurrency proved to be great for logical aggregation of components, and not so great for performance optimization. Better database access would be better for that.

Problem: Balancing the load

If we have a tree-like request with sub-requests, a simple solution would be to route this initial request via a load balancer into a pool, and then run all subrequests on the same machine. This leads to a lumpy distribution of work. The next step from here is one uniform pool, and routing the subrequests back into that pool again, to have a good balance across the cluster. Over time (and because we experimented with hhvm), we created three pools that correspond to the three logical tiers of endpoints. In this way, we can monitor and scale each class of endpoints separately, even though each node in all three clusters works the same way.

Where would this not work?

If we sit back and think about this for a bit – how is this architecture specific to Etsy’s ecosystem? Where wouldn’t it work? What are the known problems?
The most obvious gaping hole is that we have no API versioning. How do we even get away with that? We solve this by keeping our public API small and our internal API very very fluid. Since we control both ends of the internal API via client generation and meta-endpoints, the intermediate language of domain objects is free to evolve. It’s tied into our continuous deployment system, moving along with up to 60 deploys per day for And the client is constantly in flux for the internal API.

And as long as it’s internal at Etsy, even the outside layer of bespoke AJAX endpoints is very malleable and matures over time.
Of course this is different for the Apps and the third party API, but those branch off after maturing on the internal API service over several years. Software development companies who focus on an extensive public API or even have that as the main service could not work in this way. They would need an internal place to let the API endpoints mature, which we do on the internal API service that is powering Etsy.

We know there are very technical solutions to version changes being used in our industry, such as ESPN having a JSON schema, and publishing just a schema change, like a diff, which can be smaller than 100k. That’s really exciting, but we’re just not at the point where this is our most important priority, since we don’t have too many API consumers at Etsy yet. We ourselves are our biggest consumer, and generated clients shield us from the versioning problem for now, while giving us the benefit of a monorepo-like ecosystem, in which we can refactor without boundaries between PHP and JavaScript.

Operations: Tooling

Let’s talk about tooling that we built to learn more about the behavior of our code in practice. Most of the tools that we developed for API v3 are around monitoring the new distributed system.

CrossStitch: Distributed tracing

As we know, with API v2, we had the problem that almost an arbitrary amount of single threaded work could be generated based on the query parameters, and this was really hard to monitor. Moving from the single-threaded execution model to a concurrent model triggering multiple API requests was even more challenging to monitor. You can still profile individual requests with the usual logging and metrics, but it’s hard to get the entire picture. Child requests need to be tied to back to the original request that triggered them, but they themselves might be executed elsewhere on the cluster.


To visualize this, we built a tool for distributed tracing of requests, called CrossStitch. It’s a waterfall diagram of the time spent on different tasks when loading a page, such as HTTP requests, cache queries, database queries, and so on. In darker purple, you can see different HTTP requests being kicked off for a shop’s homepage, for example you see the request for the shop’s about page is running in parallel with requests for other API components.

Fanout Manager: Feedback on fanout limit exhaustion for developers

Bespoke API calls can create HTTP request fanout to concurrent components, which in turn can create fanout to atomic component endpoints. This can create a strain on the API and database servers that is not easy for an endpoint developer to be aware of when building something in the development environment or rolling out a change to a small percentage of users.

The fanout manager aims to put a ceiling on the overall resource requests that are in flight, and we are doing this in the scheduler part of the curl callback orchestrator by keeping track of sub-requests in memcached. When a new request hits the API server, a key based on the unique user ID of that root request is put into memcached. This key works as a counter of parallel in-flight requests for that specific root request. The key is being passed on to the concurrent and component endpoint subrequests. When the scheduler runs a subrequest, it increments the counter for that key. When the request got a response and it’s slot is freed in the scheduler, the counter for the key is decremented. So we always know how many total subrequests are in-flight for one root request at the same time.

In a distributed system like this, multiple requests can be competing for the same slot. We have a problem that requires a lock.
To avoid the lock overhead, we circumvent the distributed locking problem by relying on memcached’s atomic increment and decrement operation. We optimistically first increment the memcached key counter, and then check whether the operation was valid and we actually got the slot. Sometimes we have to decrement again because this optimistic assumption is wrong, but in that case we are waiting for other requests to finish anyway and the extra operation makes no difference.

If an endpoint has too many sub-requests in flight, it just waits before being able to make the next request. This provides a good feedback for our developers about the complexity of the work before the endpoint goes into production. Also, the fanout limit can be hand-tweaked for specific cases in production, where we absolutely need to fetch a lot of data, and a higher number of parallel requests speeds up that fetching process.

Automated documentation of endpoints: datafindr

We also have a tool for automated documentation of new endpoints. It is called datafindr. It shows endpoints and typed resources, and example calls to them, based on a nightly snapshot of the API landscape.screen-shot-2016-06-22-at-16-20-16

Wanted: Endpoint decommission tool

Writing new endpoints is easy in our framework, but decommissioning existing endpoints is hard. How can we find out whether an existing endpoint is still being used?

Right now we don’t have such a tool, and to decommission an existing endpoint, we have to explicitly log whether that specific endpoint is called, and wait for an uncertain period of time, until we feel confident enough to decide that no one is using it any more. However, in theory it should be possible to develop a tool that monitors which endpoints become inactive, and how long we have to wait to gain a statistically significant confidence of it being out of use and safe to remove.

This is the second post in a series of three about Etsy’s API, the abstract interface to our logic and data. The next post will be published in two weeks, and will cover the adoption process of the new API platform among Etsy’s developers. How do you make an organization switch to a new technology and how did this work in the case of Etsy’s API transformation?


Introducing 411: A new open source framework for handling alerting

Posted by and on September 15, 2016 / 2 Comments

Back in 2014, Etsy started using the ELK (Elasticsearch, Logstash & Kibana) stack. We’ve previously written about how we use saved searches as a reactive security mechanism. When we made the transition to ELK, we noticed there was no way to automatically schedule searches and be notified on the results. Today, we’re introducing our open source solution to this problem: 411.


411 is a query scheduler: it executes saved Elasticsearch queries against your cluster, formats the results, and sends them to you as an alert. Our motivation behind creating 411 was to enable us to easily create customizable alerts to enhance our ability to react to important security events.

As a part of that customizability, 411 gives you lots of options for applying filters to the results that are returned from a search. This includes things like removing duplicate alerts, throttling the number of alerts, as well as the ability to forward the alerts on to other systems like JIRA or a webhook. 411 also provides a robust way to handle responding to multiple alerts as well as an audit log to help keep track of changes to searches and alerts.


In addition to the default search functionality to utilize ELK, we’re also including some additional searches types out the box. HTTP is a lightweight Nagios alternative that you can use to alert on service outages, while Graphite allows you to query a Graphite server and alert when thresholds are exceeded (more information on Etsy’s graphite setup can be found here).

We aimed to make the code as modular as possible to make it easy to extend. We hope that you find the examples we’ve provided to be useful and informative for others to develop their own extended functionality for 411. If you do create new extensions for 411, we more than welcome reviewing your pull request to integrate new functionality into 411!

Etsy’s commitment to Open Source means we use the same version of 411 as what’s available on Github, so you can expect regular project updates. We’ve heard a lot of different ideas regarding different functionality/search types people would like to see based off our talk at Defcon, so we’d appreciate your feedback regarding how you plan on using 411 and features you’d like to see!


This post was written as a collaboration between Ken Lee and Kai Zhong. You can follow Ken on Twitter at @kennysan and you can follow Kai on Twitter at @sixhundredns.


API First Transformation at Etsy – Concurrency

Posted by on September 6, 2016 / 7 Comments

At Etsy we have been doing some pioneering work with our Web APIs. We switched to API-first design, have experimented with concurrency handling in our composition layer, introduced strong typing into our API design, experimented with code generation, and built distributed tracing tools for API as part of this project.

We faced a common challenge: much of our logic was implemented twice. All of the code that was built for the website then had to be rebuilt in our API to be used by our iOS and Android apps.


Problem: repeated logic between platforms

We wanted an approach where we built everything on reusable API components that could be shared between the web and apps. Unfortunately our existing API framework couldn’t support this shared approach. The solution we settled on was to abandon the existing framework and rebuild it from scratch.

Follow along this case study of building an API First architecture, in which functional changes are expressed on the API level before integrating them into the website. Hear what problems prompted this drastic change. Learn which new tools we had to build to be able to work with the new system and what mistakes we made along the way. Finally, how did it end? How did the team adopt the new system and have we succeeded in our goals of API First?

This post will be the first post in a series about our current API infrastructure, which we call version 3. The series is based on a talk at QCon New York. The first post will cover concurrency, the second post will cover operations and the third post the human aspects of our API transition.


First problem: More devices & platforms (also: JavaScript)

If we look into the future, it comes with lots of devices. Mainframes became desktop computers, which became portable laptops and tablets, smart phones and watches.

This trend has been going on for a while, and in order to not reinvent the world on each different device, we started sharing data via an internal API years ago.

The first version of Etsy’s API was a gateway for flash widgets. And the second one was a JSON RESTful API for 3rd parties and internal use. It was tightly coupled to the underlying database schema, and it empowered clients to make customized complex requests. It was so powerful that when we introduced our first iPad App, we did not need to write any new endpoints, and could build it solely on existing ones. Clients could request multiple resources at once, for example request shop data and also include listing data from that shop, and they could specify fields to trim down the response to just the required data. Very powerful.

Second Problem: Performance & complexity control

With great power comes great responsibility, and this approach had some drawbacks. The server code was simple, but we did not know the incoming parameters. We gave the clients control over the complexity of the request via the request parameters. This obviously had implications on server-side performance. And measuring the performance was difficult, because it was not clear if an increased response time was due to the performance of our backend, or because the client requested more resources.

Third Problem: Repetition & inconsistency

Years of changing patterns and an evolving complex codebase with MVC architecture led to bad habits: data fetch during template rendering, and logic in the templates. Our API was for AJAX, whereas the backend code was in PHP.  We did not have the logic in one place that was reusable for both the Web and API. This lead to inconsistencies between API and pre-API web.

The schema of the API resource was a snapshot of the data model at the time of exposing it via the endpoint. This one-to-one mapping caused problems with data migrations, as the API resource was “frozen in time”. Should it change with the model? How long should the old resource structure be supported?

Requirements for API-first

We re-discussed the requirements for our API. If performance, manifesting for the user as latency from request to response, was a problem, what was the bottleneck?

First, the time to glass, the time until we see something on our device’s screen, as Ilya Grigorik calls it in his talk “breaking the 1000 milliseconds time to glass”, and he states that due to mobile network speed, we have only 100 milliseconds on the server side if we want to stay in budget. The second problem is that we, at Etsy, come from a sequential-shared-nothing-php-world. No built-in concurrency. How can we parallelize and reuse our work, while still keeping the network footprint low?

V2 vs V3

API v2: repeated logic between platforms          API v3: reusable components

Other requirements were how to think about caching. The previous version of the API was memcached only, caching calls including parameters, which lead to a granularity problem. And one last requirement was to solve the problem starting from what we know and what we’re good at – building our own solutions in PHP.

Shaping our mental model

Based on these learnings, we piece-by-piece architected a new version, called API Version 3. REST resources worked well for both mobile apps and traditional web, so that was a keeper. A new idea was to decouple the endpoints from the framework that hosts them. Minimize the endpoints’ responsibilities to:

.. and that’s about it.

We have one very simple, declarative file for each endpoint.

Everything else is architected away on purpose: StatsD error monitoring, endpoint input and output type checks, and the compilation of the full routes — all of this is handled by the framework. Authentication and access control is also handled there, based on the class of endpoint that the developer has chosen.

Enter the meta-endpoint

We picked up the industry ideas from Netflix and eBay’s of server side composition of resources into device-view-specific resources. Or in other words: allowing a second layer of endpoints that are consumers of our own API, requesting and aggregating other endpoints. This means the server itself is also a client of the API, making the server more complex, while giving it more control with an extra layer for code execution. This improves performance of the client, because it only needs to make one single request – the biggest bottleneck if we want to have a responsive mobile interface!

These requests used our generated PHP client, and they used cURL. cURL? Let’s talk about this for a bit. And let’s take a step back. The interesting question is how to bring concurrency into the single-threaded world of PHP.

cURL is cool

We’re in an HTTP context, so what about making additional HTTP requests for concurrency? We examined whether this could be done with cURL.

Some time in 2013, Paul tweeted

Screen Shot 2016-06-13 at 22.00.10

“curl_multi_info_read() is my new event loop.”

In a hack week project, Paul and Matt from Etsy’s core team figured out that we could in fact achieve concurrency in the HTTP layer, through parallel cURL calls with curl_multi_info read. The HTTP layer is an interesting layer for this, since there are many existing solutions for routing, load balancing and caching.


In addition to cURL, we added logic to establish dependencies on requests to other endpoints, which we call proxies. We are running the requests when the corresponding proxy becomes unblocked, similar to an event loop, which you might know from NodeJS. The whole concurrency dependency analysis and scheduling is encapsulated within one piece of software, which we call the curl callback orchestrator.

Screen Shot 2016-08-22 at 17.51.50

This is great, because from the endpoint author’s point of view the code looks sequential and single-threaded and is just a list of proxy calls to other endpoints. We’re getting closer to a declarative style, expressing our intent, and the orchestrator figures out how to schedule the calls that are necessary for the complete result.

You Wouldn’t Reimplement an API

Ok, so we had some good observations about the previous versions of our API, and we have a working prototype for concurrency via cURL.

How did we grow an entire new API framework from here?

Perspectives and Services

Two concepts are special about Etsy’s API v3: perspectives and services.

Perspectives clarify data access rules and give us security hints on what code is permitted for each perspective. They express on whose behalf an API call is being made. So, for example, the Public perspective shows data that a logged-out user would be able to see on

Screen Shot 2016-06-22 at 16.14.34

The Member perspective is for calls made on behalf of a particular Etsy member. The user ID is determined via the user cookie or OAuth token, dependent on the Service, which we will talk about below. The Shop perspective is similar to the member perspective but is for a shop. The framework will verify that the given shop is owned by the authenticated user. The Admin perspective is like the member perspective but for Etsy Admin. We occasionally want to take actions from our own servers that may not fit the other perspectives. For this we have the Infrastructure perspective. It is only available on the private internal API and can be used for things such as dataset loading. The application perspective is for calls made on behalf of a particular API application. It contains the application data for the verified API key.


While perspectives express on whose behalf a call is being made, the service indicates from where the call is being made. A service can also be thought of as the entry point into the API framework. Each service has its own requirements regarding authentication. Endpoints are included in some services by default. Other services are opt-in, and each endpoint has to declare whether it wants to be exposed on those opt-in services.

Screen Shot 2016-08-12 at 16.20.57

The Ajax service is accessible from pages that run JavaScript on The Admin service is accessible from pages that run JavaScript on our internal admin tools platform. The internal service is used by other API services that are already inside of our API cluster network. The Apps service is accessible from our native apps in iOS and Android. The 3rd party service is for 3rd party app developers. The services separate different application domains.

An example API call

Let’s look at an example request to the homepage. We know what the homepage looks like: sections of information that might be interesting for me, as a potential buyer. Up at the top are the listings that I favorited, then some picks that Etsy’s recommendation algorithms picked for me, new items from my favorite shops, activity from my friends, and so on. I think about it as something like this.

Screen Shot 2016-06-22 at 16.05.09

If we look at the data in more detail, we see even more structure. It’s like a tree, growing from left to right.

Our setup of network and servers is mirroring the structure of the API call. It starts with an HTTP request from my browser to Etsy’s web server. From there, a bespoke API request is being made to our API server, requesting a personalized version of the homepage data. Internally, this request consists of multiple concurrent components. They themselves are fetched via API requests. Such as my favorites, which are a concurrent component, because they are a large number of listing cards that can be fetched in parallel.

So we can imagine an API request as a multi-level tree, kicking off other API requests and constructing an overall result from the results of those subrequests.

Domain specific language of API endpoints

The project that got me started diving deep into Etsy’s API v3 framework was striving to unify the syntax of API endpoints. This was really fun and involved big, automated changes to unify the API codebase. In the past, there were multiple styles in which endpoints could be written. To unify them, we carved out a language of endpoint building blocks.


Some building blocks are mandatory for each endpoint. Each endpoint needs to declare its route, so we know where it should be found on the web. Also, it needs a human readable description, and a resultType.

Screen Shot 2016-06-22 at 16.16.23

The result type describes what type of data the endpoint returns. All data we return is JSON encoded, but here we can say that we return a primitive data type, such as a string or a boolean inside that encoding. Or we could return what we call “a typed resource” – a compound type that refers to a specific component of the Etsy application domain, such as a ListingCard.

And then there is the handle function. In there, every endpoint runs the code that it needs to run, to build its response.

Screen Shot 2016-06-22 at 16.16.26

Optional building blocks of an API endpoint are also possible. declareInput is only necessary if the endpoint does actually need input parameters. If it doesn’t, the function can be left out.

The includedServices function allows an endpoint to opt into specific services. The EtsyApps service is opt-in for example, so if you want to make your endpoint available on the apps, you have to opt into the EtsyApps service via this function.

And then there is the cacheTtlSeconds function, which allows you to specify whether an endpoint should be cached, and what should be it’s time to live.

Input and output: Typed parameters, typed result

The first step when a request is being routed to the endpoint, is the setup of the input parameters. We create an input object based on the request’s URL and the endpoint’s declareInput function.

The input declaration tells us how to check for optional or mandatory input parameters, which are parsed according to a pattern in the route. If a parameter is missing or of the wrong type, the framework returns an HTTP error code and message. The input declaration specifies a type for each parameter, such as a string or a user ID. The types are Etsy-specific, and each one comes with its own validation function which is being run by the framework. According to the perspective, information about the logged in user, the logged in admin, shop, or authenticated app is being checked as well, and added to the input object.

Each endpoint specifies its own output type via the resultType function. Currently, those types are optional and of different level of detail. We encourage developers to either return a primitive datatype, or to build a compound type, called typed resource, corresponding to the shape of the data that their endpoint returns. Type guarantees are useful for the API clients, and bring us one step closer to having guarantees on our data from the browser input field to the the database record.

To make our framework complete, we’re still missing some action on both ends. How does an API request get routed to an endpoint? And how can we make an API request from our code, for example inside a meta-endpoint or in JavaScript when our site uses AJAX?

Tooling: API compiler

We need two more pieces of software, which we can automatically compile based on the endpoint declaration files. This is the job of the API compiler. Initially, this was a script that took the routes from the endpoint declarations, together with the service and perspective information, and compiled these into full routes for apache by modifying the .htaccess files. Performance concerns were alleviated by splitting up the work and files by perspective.

Over time, we also added a second part: the generation of API client code in PHP and in JavaScript. The code is being generated using a mustache template, which is a template language for websites, but works well in this context, too. Before we deploy code to, we check if the compiled routes and client code are up to date via Jenkins. In this way, we control both ends of the API stack from the database access code to the outer shape of the endpoint landscape, which is reflected in changes to the client. And we neatly tie this into our continuous deployment process.

This is the first post in a series of three about Etsy’s API, the abstract interface to our logic and data. The next post covers the operational side of Etsy’s API.


Recommended Reading for Allies

Posted by and on August 10, 2016 / 3 Comments

Etsy believes in the power of diversity. We believe that having diverse perspectives will help us make better decisions and build better products. We also know that it’s not enough to just recruit diverse talent: we’ve got to retain it!

A key to retaining diverse talent is fostering a supportive work environment. There are a lot of major organizational changes that can help (flexible work arrangements, equal pay, and opportunities for growth and leadership to name a few), but what can you—the individual—really do to help?

Clippy the Ally

It sounds like you want to be an ally! An ally is a person in a position of privilege who offers to share the power, access, and authority that come with that privilege with members of a non-privileged group.

Diversity is intersectional, not limited to gender, race, or any other single axis of identity. Great news: Allyship is intersectional as well! If you’re a man, you can serve as an ally to women. If you’re white, you can serve as an ally to people of color. If you can see, you can serve as an ally to people with vision loss. Anyone can use their privilege to create opportunities for people more marginalized than themselves.

On August 11 in Dublin, Etsy software engineers Toria Gibbs and Ian Malpass will be running a workshop on being an effective male ally to people who identify as women and other underrepresented populations in tech.

One important strategy for being an effective ally is self-education. Women are frequently expected to teach introductory feminism and entertain discussions on “being a woman in tech” with anyone who asks. It’s a great burden to shoulder and frankly a waste of their time. You wouldn’t ask Rasmus to teach you how to write a Hello World program in PHP, right? No! You would go out and find the articles, tutorials, and forum threads that already exist for beginners.

With that, we introduce our list of recommended reading for allies.

Introductory feminism

Why do we need feminism? Analogies on privilege

On allyship

Studies, reports

Opinion pieces, personal experiences

Blogs, magazines

Other fun stuff


While this list is not exhaustive, it should be more than enough to get you started on your journey. Happy learning!
If you’re interested in hosting your own event to promote male allyship, we recommend checking out NCWIT’s Male Allies and Advocates Toolkit or Ada Initiative’s Ally Skills Workshop.

You can read more about Etsy’s diversity in our latest annual Diversity and Equality Progress Report.
Update 08/12/2016: Slides from Toria and Ian’s presentation are now available on Speaker Deck!


Q1 2016 Site Performance Report

Posted by , and on April 28, 2016 / 1 Comment

Spring has sprung and we’re back to share how Etsy’s performance fared in Q1 2016. In order to analyze how the site’s performance has changed over the quarter, we collected data from a week in March and compared it to a week’s worth of data from December. A common trend emerged from both our back-end and front-end testing this quarter: we saw significant site-wide performance improvements across the board.

Several members of Etsy’s performance team have joined forces to chronicle the highlights of this quarter’s report. Moishe Lettvin will start us off with a recap of the server-side performance, Natalya Hoota will discuss the synthetic front-end changes and Allison McKnight will address the real user monitoring portion of the report. Let’s take a look at the numbers and all the juicy context.

Server-Side Performance

Server-side time measures how long it takes our servers to build pages. These measurements don’t include any client-side time — this measurement is the amount of time it takes from receiving an HTTP request for a page to returning the response.

As always, we start with these metrics because they represent the absolute lowest bound for how long it could take to show a page to the user, and changes in these metrics will be reflected in all our other metrics. This data is calculated by a random sample of our webserver logs.

Happily, this quarter we saw site-wide performance improvements, due to our upgrade to PHP7. While our server-side time isn’t spent solely running PHP code (we make calls to services like memcache, MySQL, Redis, etc.), we saw significant performance gains on all our pages.


One of the primary ways that PHP7 increases performance is by decreasing memory usage. In the graph below, the blue & green lines represent 95th percentile of memory usage on a set of our PHP5 web servers, while the red line indicates the memory usage on a set of our PHP7 web servers during our switch-over from PHP5 to PHP7. This is two days’ worth of data; the variance in memory usage (more visible in the PHP5 data) is due to daily variation in server load. Note also that the y-axis origin is 0 on this graph — the decrease shown in this graph is to scale!


Another interesting thing that’s visible in the box plots above is how the homepage server-side timing changed. It sped up across the board, but the distribution widened significantly. The reason for this is that the homepage gets many signed-out views as well as signed-in views, as contrasted with, for instance, the cart view page. In addition, the homepage is much more customized for signed-in users than, say, the listing page. This level of customization requires calls to our storage systems, which weren’t sped up by the PHP7 upgrade. Therefore, the signed-in requests didn’t speed up as much as the signed-out requests, which are constrained almost entirely by PHP speed. In the density plot below, you can see that the bimodal distribution of the homepage timing became more distinct in Q1 (red) vs Q4 of last year (blue):


The Baseline page gives us the clearest view into the gains from PHP7 — the page does very little outside of running PHP code, and you can see in the first chart above that the biggest impact was on the Baseline page. The median time went from 80ms to 29ms, and the variance decreased.

Synthetic Start Render

We collect synthetic (i.e., obtained by web browser emulation of scripted web visits) monitoring results in order to cross-check our findings for two other sets of data. We use a third party provider, Catchpoint, to record start render time — a moment when a user first sees content appearing on the screen — as a reference point.

Synthetic tests showed that all pages got significantly faster in Q1.


It is worth noting that the data for synthetic tests is collected for signed-out web requests. As previously mentioned, this type of request involves fetching less state from storage, which highlights PHP7 wins.

I noticed the unusually far-reaching outliers on the shop page and search pages and decided to investigate further. At first, I generated scatterplots for the two pages in question, which clarified that outliers on their own did not have a story to tell — no clustering patterns or high volumes on any particular day. Having a better data sanitation process would have eliminated the points in question altogether.

In order to get a better visual for the synthetic data we used, I took comparative scatterplots of all six pages we monitor. I noticed a reduction of failed tests (marked by red diamonds) that happened sometime between Q4 and present day. Remarkably, we have never isolated that data set before. A closer look revealed an important source of many current failures: fetching third party resources was taking longer than the maximum time allotted by Catchpoint. That encouraged us to consider a new metric for monitoring the impact of third party resources.


Real User Page Load Time

Gathering front-end timing metrics from real users allows us to see the full range of our pages’ performance in the wild. As in past reports, we’re using page load metrics collected by mPulse. The data used to generate these box plots is the median page load time for each minute of one week in each quarter.


We see that for the most part, page load times have gone down for each page.  The boxes and whiskers of each plot have moved down, and the outliers tend to be faster as well. But it seems that we’ve also picked up some very slow outliers: each page has a single outlier of over 10 seconds, while the rest of the data points for all pages are far under that. Where did these slow load times come from?

Because each of our RUM data points represents one minute in the week and there is exactly one extremely slow outlier for each page, it makes sense that these points might be all from the same time. Sure enough, looking at the raw numbers shows that all of the extreme outliers are from the same minute!

This slow period happened when a set of boxes that queue Gearman jobs were taken out to be patched for the glibc vulnerability that surfaced this quarter. We use Gearman to run a number of asynchronous jobs that are used to generate our pages, so when the boxes were taken out and fewer resources were available for these jobs, back-end times suffered.

One interesting thing to note is that we actually didn’t notice when this happened. The server-side times (and therefore also the front-end times) for most of our pages suffered an extreme regression, but we weren’t notified. This is actually by design!

Sometimes we experience a blip in page performance that recovers very quickly (as was the case with this short-lived regression). It makes little sense to scramble to understand an issue that will automatically resolve within a few minutes — by the time we’ve read and digested the alert, it may have already resolved itself — so we have a delay built in to our alerts. We are only alerted to a regression if a certain amount of time has passed and the problem hasn’t been resolved (for individual pages, this is 40 minutes; when a majority of our pages’ performance degrades all at once, the delay is much shorter).

This delay ensures that we won’t scramble to respond to an alert if the problem will immediately fix itself, but it does mean that we don’t have any insight into extreme but short-lived regressions like this one, and that means we’re missing out on some important information about our pages’ performance. If something like this happens once a week, is it still something that we can ignore? Maybe something like this happens once a day — without tracking short-lived regressions, we don’t know. Going forward, we will investigate different ways of tracking short-lived regressions like this one.

You may also have noticed that while the slowdown that produced these outliers originated on the server side, the outliers are missing from both the server-side and synthetic data. This is because of the different collection methods that we use for each type of data. Our server-side and synthetic front-end datasets each contain 1,000 data points sampled from visits throughout the week (with the exception of the baseline page, which has a smaller server-side dataset because it receives fewer visits than other pages). This averages to only 142 data points per day — far under one datapoint per minute — and so it’s likely that no data from the short regression made it into the synthetic or server-side datasets at all. Our front-end RUM data, on the other hand, has one datapoint — the median page load time — for every minute, so it was guaranteed that the regression would be represented in that dataset as long as at least 50% of views were affected.

The nuances in the ways that we collect each metric are certainly very interesting, and each method has its pros and cons (for example, heavily sampling our server-side and synthetic front-end data leads to a narrower view of our pages’ performance, but collecting medians for our front-end RUM data to display in box plots is perhaps statistically unsound). We plan to continue iterating on this process to make it more appropriate to the report and more uniform across our different monitoring stacks.


In the first quarter of 2016, performance improvements resulting from the server-side upgrade to PHP7 trickled down through all our data sets and the faster back-end times translated to speed ups in page load times for users. As always, the process of analyzing the data for the sections above uncovered some interesting stories and patterns that we may have otherwise overlooked. It is important to remember that the smaller stories and patterns are just as valuable of learning opportunities as the big wins and losses.

1 Comment

How Etsy Formats Currency

Posted by on April 19, 2016 / 21 Comments

Imagine how you would feel if you went into a grocery store, and the prices were gibberish (“1,00.21 $” or “$100.A”). Would you feel confident buying from this store?

Etsy does business in more than 200 regions and 9 languages. It’s important that our member experience is consistent and credible in all regions, which means we have to format prices correctly for all members.

In this post, I’ll cover:

In order to follow along, you need to know one important thing: Currency formatting depends on three attributes: the currency, the member’s location, and the member’s language.

Examples of bad currency formatting

Here are some examples of bad currency formatting:

If you don’t know why the examples above are confusing, read on.

What’s wrong with: A member browsing in German goes to your site and sees an item for sale for “1,000.21 €”?

The first example is the easiest. If a member is browsing in German, the commas and decimals in a price should be flipped. So “1,000.21 €” should really be formatted as “1.000,21 €”. This isn’t very confusing (as a German member, you can figure out what the price is *supposed* to be), but it is a bad experience.

By the way, if you are in Germany, using Euros, but browsing in English, what would you expect to see? Answer: “€1,000.21”. The separators and symbol position are based on language here, not region.

What’s wrong with: A Japanese member sees an item selling for “¥ 847,809.34”?

Japanese Yen doesn’t have a fractional part. There’s no such thing as half a Yen. So “¥ 847,809.34” could mean “¥ 847,809”, or “¥ 84,780,934” or something else entirely.

What’s wrong with: A Canadian member sees “$1.00”?

If your site is US-based, this can be confusing. Does “$” mean Canadian dollar or US dollar here? A simple fix is to add the currency code at the end: “$1.00 USD”.

How to format currency correctly

Etsy's locale settings picker

Etsy’s locale settings picker

Formatting currency for international members is hard. Etsy supports browsing in 9 languages, 23 currencies, and hundreds of regions. Luckily, we don’t have to figure out the right way to format in all of these combinations, because the nice folks at CLDR have done it for us. CLDR is a massive database of formatting styles that gets updated twice a year. The data gets packaged up into a portable library called libicu. Libicu is available everywhere, including mobile phones. If you want to format currency, you can use CLDR data to do it.

For each language + region + currency combination, CLDR gives you:

A typical pattern looks like this:

A cldr pattern (#,##0.00)

A cldr pattern

This is the pattern for German + Germany + Euros. It tells you:

NOTE: the pattern does *not* tell you what the decimal and grouping separators are. CLDR gives you those separately, they are not a part of the pattern.

Now you can use this information to format a value:

#,##0.## translates to 1.000,21

If you want to format prices using CLDR, your language might have libraries to do it for you already. PHP has NumberFormatter, for example. JavaScript has Intl.NumberFormat.

Practical implementation decisions

CLDR is great, but it is not the ultimate authority. It is a collaborative project, which means that anyone can add currency data to CLDR, and then everyone votes on whether the data looks correct or not. People can also vote to change existing currency data.

CLDR data is not a precise thing, it is fluid and changing. Sometimes you need to customize CLDR for your use case. Here are the customizations we made.

The problem with currencies that use a dollar sign ($)

We use CLDR to format currency at Etsy, but we’ve made some changes to it. One issue in particular has really bugged us. Dollar currencies are really hard to work with. The symbol for CAD (Canadian dollars) is “$” in Canada, but it is “CA$” in the US and everywhere else to avoid confusion with US Dollars. So if we followed CLDR, Canadian members would see “$1.00”. But our Canadian members might know that Etsy is a US-based company, in which case “$” would be ambiguous to them — it could mean either Canadian dollars or US dollars. Here is how we choose a currency symbol to avoid confusion while still meeting member expectations:

What symbol does Etsy use for dollar-based currencies?

What symbol does Etsy use for dollar-based currencies?

Here is the value “1000.21” formatted in different currency + region combinations:


You might be wondering, why not just add the currency code to the end of the price? For example, it could be “$1,000.21 USD” for US dollars, and “$1,000.21 CAD” for Canadian dollars. This is also explicit but we don’t need to have complicated logic to change the currency symbol. But this approach has another issue: redundancy.

Suppose we did add the currency code at the end everywhere to address the CAD problem. Euros would get formatted as “1.000,21 € EUR”, but the “€ EUR” is redundant. Even worse, Swiss Francs doesn’t have a currency symbol, so CLDR recommends using the currency code as the currency symbol. Which means they would see “1.000,21 CHF CHF”, which is definitely redundant:

Adding the currency code at the end is explicit, but doesn’t meet member expectations. Our German members said they didn’t like how “1.000,21 € EUR” looked.

In the end Etsy decided not to show the currency code. Instead, we change the currency symbol as needed to avoid confusion.


Listing price with settings English / Canada / Canadian dollars

Listing price with settings English / Canada / Canadian dollars

Overriding CLDR data

Here’s a simple case where we overrode CLDR formatting. We are a website, so of course we want our prices to be wrapped in html tags so that they can be styled appropriately. For example, on our listings manager, we want to format price input boxes correctly based on locale:




It’s hard to wrap a price in html tags *after* you have done the formatting: sometimes the symbol is at the end, sometimes there’s a space between the symbol and value, and sometimes there isn’t, etc etc. To make this work, the html tags need to be a part of the pattern, so we need to be able to override the CLDR patterns directly.

Ultimately we ended up overriding a lot of the default CLDR data:

Different libraries offered different levels of support for this. PHP’s NumberFormatter lets you override the pattern and symbol. JavaScript’s Intl.NumberFormat lets you override neither. None of the libraries had support for wrapping html tags around the output. In the end, we wrote our own JavaScript library and added wrappers for the rest.

Consistent formatting across platforms

We had to format currency in PHP, JavaScript, and in our iOS and Android apps. PHP, JavaScript, iOS and Android all had different versions of libicu, and so they had different CLDR data. How do we format consistently across these platforms? We went with a dual plan of attack: write tests that are the same across platforms, and make sure all CLDR overrides get shared between platforms.

We wrote a script that would export all our CLDR overrides as JSON / XML / plist. Every time the overrides change, we run the script to generate new data for all platforms. Here’s what our JSON file looks like right now (excerpt):

    "de_AU": {
        "symbol": {
            "AUD": "AU$",
            "BRL": "R$",
            "CAD": "CA$"
        "decimal_separator": ",",
        "grouping_separator": ".",
        "pattern": {
            "AUD": "#,##0.00 \u00a4",
            "BRL": "#,##0.00 \u00a4",
            "CAD": "#,##0.00 \u00a4"

We wrote another script to generate test fixtures, which look like this (excerpt):

"test_symbol&&!code&&!html": {
    "de": {
        "DE": {
            "EUR": {
                "100000": "1.000,00 \u20ac",
                "100021": "1.000,21 \u20ac"
        "US": {
            "EUR": {
                "100000": "1.000,00 \u20ac",
                "100021": "1.000,21 \u20ac"
            "USD": {
                "100000": "1.000,00 $",
                "100021": "1.000,21 $"

This test says that given these settings:

We have hundreds of tests in total to check every combination of language/region/currency code with symbol shown vs. hidden, formatted as text vs. html, etc. These expected values get checked against the output of the currency formatters on all platforms, so we know that they all format currency correctly and consistently. Any time an override changes (for example, changing the symbol for CAD to be “CA$” in all regions), we update the CLDR data file so that the new override gets spread to all platforms. Then we update the test fixtures and re-run the tests to make sure the override worked on all platforms.


No more “¥ 847,809.34”! Formatting currency is hard. If you want to do it correctly, use the CLDR data, but make sure that you override it when necessary based on your unique circumstances. I hope our changes lead to a better experience for international members. Thanks for reading!


Building a Translation Memory to Improve Machine Translation Coverage and Quality

Posted by and on March 22, 2016 / 2 Comments

Machine Translation at Etsy

At Etsy, it is important our global member base can communicate with one another, even when they don’t speak the same language. Whether users are browsing listings, messaging other users, or posting comments in the forums, machine translation is a valuable tool for facilitating multilingual interactions on our site and in our apps.

Listing descriptions account for the bulk of text we machine translate. With over 35 million active listings at an average length of nearly 1,000 characters, and 10 supported site languages, we need to translate a lot of content—and that’s just for listings. We also provide machine translation for listing reviews, forum posts, and conversations (messaging between members). We send text we need to translate to a third party machine translation service, and given the associated cost, there is a limit to the number of characters we can translate per month.

Listing Review

An example listing review translation.

While a user can request a listing translation if we don’t already have one (we call this on-demand translation), translating a listing beforehand and showing a visitor the translation automatically (we call this pre-translation) provides a more fluid browsing experience. Pre-translation also allows listings to surface in search results in multiple languages, both for searches on Etsy and on external search engines like Google.

The Benefits of a Translation Memory

Many of the strings we machine translate from one language to another are text segments we’ve seen before. Our most common segments are used in millions of listings, with a relatively small subset of distinct segments  accounting for a very large proportion of the content. For example, the sentence “Thanks for looking!” appears in around 500,000 active listings on Etsy, and has appeared in over 3 million now inactive listings.

Zipfian Shape

Frequency and rank of text segments (titles, tags, and description paragraphs) appearing in listings on Etsy. The distribution of segments roughly conforms to a classic Zipfian shape, where a string’s rank is inversely proportional to its frequency.

In the past, a single text segment that appeared in thousands of listings on Etsy would be re-translated once for every listing. It would also be re-translated any time a seller edited a listing. This was a problem: our translation budget was being spent on millions of repeat translations that would be better used to translate unique content into more languages.

To solve this problem, we built a translation memory. At its simplest, a translation memory stores a text segment in one language and a corresponding translation of that segment in another language. Storing strings in a translation memory allows us to serve translations for these strings from our own databases, rather than making repeated requests to the translation service.

Storing these translations for later reuse has two main benefits:

  1. Coverage: By storing common translations in the translation memory and serving them ourselves, we drastically reduce the number of duplicate segments we send to the translation service. This process lets us translate seven times more content for the same cost.

  2. Quality: We’re able to see which text segments are most commonly used on Etsy and have these segments human translated. Overriding these common segments with human translations improves the overall quality of our translations.

Initial Considerations

We had two main concerns when planning the translation memory architecture:

  1. Capacity: The more text segments we store in the translation memory, the greater our coverage. However, storing every paragraph from each of our more than 35 million active listings, and a translation of that paragraph for each of our 10 supported languages, would mean a huge database table. Historically, Etsy has rarely had tables exceeding a few billion rows, and we wanted to keep that maximum limit here.

  2. Deletions: The translation service’s quality is continually improving, and to take full advantage of these improvements we need to periodically refresh entries in the translation memory by deleting older translations. We wanted to be able to delete several hundred million rows on a monthly basis without straining system resources.

The Translation Memory Architecture

Our Translation Memory consists of several separate services, each handling different tasks. A full diagram of the pipeline is below:

TM Overview Diagram

* The external translation service is Microsoft Translator.

A brief overview of each step:

  1. Splitting into segments: The first step of the translation pipeline is splitting blocks of text into individual segments. The two main choices here were splitting by sentence or splitting by paragraph. We chose the latter for a few reasons. Splitting by sentence gave us more granularity, but our estimated Translation Memory hit rate was only 5% higher with sentences versus paragraphs. The increased hit rate wasn’t high enough to warrant the extra logic needed to split by sentence, nor the multi-fold database size increase to store every sentence, instead of just every paragraph. Moreover, although automatic sentence boundary detection systems can be quite good, a recent study evaluated the most popular systems on user-generated content and found that accuracy peaked at around 95%. In contrast, using newline characters to split paragraphs is straightforward and an almost error-free way to segment text.

  2. Excluder: The excluder is the first service we run translations through. It removes any text we don’t want to translate. For now this means lines containing only links, numbers, or special characters.

  3. Human Translation Memory (HTM): Before looking for a machine translation, we check first for an existing human translation. Human translations are provided by Etsy’s professional translators (the same people who translate Etsy’s static site content). These strings are stored in a separate table from the Machine Translation Memory and are updated using an internal tool we built, pictured below.

Human TM Interface

  1. Machine Translation Memory (MTM): We use sharded MySQL tables to store our machine translation entries. Sharded tables are a well-established pattern at Etsy, and the system works especially well for handling the large row count needed to accommodate all the text segments. As mentioned earlier, we periodically want to delete older entries in the MTM to clear out unused translations, and make way for improved translations from the translation service. We partition the MTM table by date to accommodate these bulk deletions. Partitioning allows us to drop all the translations from a certain month without worrying about straining system resources or causing lag in our master-master pairs.

  2. External Translation Service: If there is new translatable content that doesn’t exist in either our HTM or MTM, we send it to the translation service. Once translated, we store the segment in the MTM so it can be used again later.

  3. Re-stitching segments: Once each of the segments has passed through one of our four services, we stitch them all back together in the proper order.

The Results

We implemented the Excluder, HTM, and MTM in that order. Implementing the Excluder first allowed us to refine the text splitting, restitching, and monitoring aspects of the pipeline before worrying about data access. Next we built the HTM and populated it with several hundred translations of the most common terms on Etsy. Finally, at the end of November 2015, we began storing and serving translations from the MTM.

Translation Memory Rampup Graph

Coverage: As you can see from the graphs above, we now only send out 14% of our translations to the translation service, and the rest we can handle internally. Practically, this means we can pre-translate over seven times more text on the same budget. Prior to implementing the translation memory, we pre-translated all non-English listings into English, and a majority of the rest of our listings into French and German. With the translation memory in place, we are pre-translating all eligible listings into English, French, German, Italian, and Japanese, with plans to scale to additional languages.

Quality: Around 1% of our translations (by character count), are now served by the human translation memory. These HTM segments are mostly listing tags. These tags are important for search results and are easily mis-translated by an MT system because they lack the context a human translator can infer more easily. Additionally, human translators are better at conveying the colloquial tone often used by sellers in their listing descriptions. With the HTM in place, the most common paragraph on Etsy, “Thanks for looking!” is human translated into the friendlier, “Merci pour la visite !” rather than the awkward, “Merci pour la recherche !” The English equivalent of this difference would be, “Thanks for visiting!” versus “Thanks for researching!”

Monitoring: Since a majority of our translation requests are now routed to the MTM rather than the third-party translation service, we monitor our translations to make sure they are sufficiently similar to those served by the translation service. To do this, we sample 0.1% of the translations served from the MTM and send an asynchronous call to the translation service to provide a reference translation of the string. Then we log the similarity (the percentage of characters in common) and Levenshtein distance (also known as edit distance) between the two translations. As shown in the graph below, we track these metrics to ensure the stored MTM translations don’t drift too far from the original third party translations.


For comparison, as you can see below, the similarity for HTM translations is not as high, reflecting the fact that these translations were not originally drawn from the third party translation service.


Additional Benefits

Correcting mis-translations: Machine translation engines are trained on large amounts of data, and sometimes this data contains mistakes. The translation memory gives us more granular control over the translated content we serve, allowing us to override incorrect translations while the translation service we use works on a fix. Below is an example where “Realistic bird” is mis-translated into German as “Islamicrevolutionservice.”Realistic Bird Mis-translation

With the translation memory, we can easily correct problematic translations like this by adding an entry to the human translation memory with the original listing title and the correct German translation.

Respecting sellers’ paragraph choices: Handling paragraph splitting ourselves had the additional benefit of improving the quality of translation for many of our listings. Etsy sellers frequently include lists of attributes and other information without punctuation in their listings. For example:

Dimensioni 24×18 cm
Spedizione in una scatola protettiva in legno
Verrà fornito il codice di monitoraggio (tracking code)

The translation service often combines these lists into a single sentence, producing a translation like this:

Size 24 x 18 cm in a Shipping box wooden protective supplies the tracking code (tracking code)

By splitting on paragraphs, our sellers’ choice of where to put line breaks is now always retained in the translated output, generating a more accurate (and visually appealing) translation like this:

Size 24 x 18 cm
Shipping in a protective wooden box
You will be given the tracking code (tracking code)

Splitting on paragraphs prior to sending strings out for translation is an improvement we could have made independent of the translation memory, but it came automatically with the infrastructure needed to build the project.


Greater accuracy for listing translations means buyers can find the items they’re looking for more easily, and sellers’ listings are more faithfully represented when translated. To continue improving quality, over the next month we are rolling out a machine translation engine trained on top of the translation service’s generic engine. A machine translation engine customized with Etsy-specific data, in conjunction with more human translated content, will produce higher-quality translations that more closely reflect the colloquialisms of our sellers.

Building a community-centric, global marketplace is a core tenet of Etsy’s mission. Machine translation is far from perfect, but it can be a valuable tool when fostering an online community built around human interaction. The translation memory allows us to bring this internationalized Etsy experience to more users in more languages, making it easier to connect buyers and sellers from around the world.


Putting the Dev in Devops: Bringing Software Engineering to Operations Infrastructure Tooling

Posted by on February 22, 2016 / 6 Comments

At Etsy, the vast majority of our computing happens on physical servers that live in our own data centers. Since we don’t do much in the cloud, we’ve developed tools to automate away some of the most tedious aspects of managing physical infrastructure. This tooling helps us take new hardware from initial power on to being production-ready in a manner of minutes, saving time and energy for both data center technicians racking hardware and engineers who need to bring up new servers. It was only recently, however, that this toolset started getting the love and attention that really exemplifies the idea of code as craft.

The Indigo Tool Suite

The original idea for this set of tools came from a presentation on Scalable System Operations that a few members of the ops team saw at Velocity in 2012. Inspired by the Collins system that Tumblr had developed but disappointed that it wasn’t yet (at the time) open source or able to work out of the box with our particular stack of infrastructure tools, the Etsy ops team started writing our own. In homage to Tumblr’s Phil Collins tribute, we named the first ruby script of our own operations toolset after his bandmate Peter Gabriel. As that one script grew into many, that naming scheme continued, with the full suite and all its components eventually being named after Gabriel and his songs.

While many of the technical details of the architecture and design of the tool suite as it exists today are beyond the scope of this post, here is a brief overview of the different components that currently exist. These tools can be broken up into two general categories based on who uses them. The first are components used by our data center team, who handle things like unboxing and racking new servers as well as hardware maintenance and upgrades:

The other set of tools are primarily used by engineers working in the office, enabling them to take boxes that have already been set up by the data center team and sledgehammer and get them ready to be used for specific tasks:

The interface to install a new server with the Gabriel tool

While many of the details of the inner workings of this automation tooling could be a blog post in and of themselves, the key aspect of the system for this post is how interconnected it is. Sledgehammer’s unattended mode, which has saved our data center team hundreds—if not thousands—of hours of adding server information to RackTables by hand, depends on the sledgehammer payload, sledgehammer executor, API, and the shared libraries that all these tools use all working together perfectly. If any one part of that combination isn’t working with the others, the whole thing breaks, which gets in the way of people, especially our awesome data center team, getting their work done.

The Problem

Over the years, many many features have been added to Indigo, and as members of the operations team worked to add those features, they tried to avoid breaking things in the process. But testing had never been high on Indigo’s list of priorities – when people started working on it, they thought of it more as a collection of ops scripts that “just work” rather than a software engineering project. Time constraints sometimes played a role as well – for example, sledgehammer’s unattended mode in all its complex glory was rolled out in one afternoon, because a large portion of our recent data center move was scheduled for the next day and it was more important at that point to get that feature out for the DC team to use than it was to write tests.

For years, the only way of testing Indigo’s functionality was to push changes to production and see what broke—certainly not an ideal process! A lack of visibility into what was being changed compounded the frustration with this process.

When I started working on Indigo, I was one of the first people to have touched that code that has a formal computer science background, so one of the first things I thought of was adding unit tests, like we have for so much else of the code we write at Etsy. I soon discovered that, because the majority of the Indigo code had been written without testability in mind, I was going to have to do some significant refactoring to even get to the point where I could start writing unit tests, which meant we had to first lay some groundwork in order to be able to refactor without being too disruptive to other users of these tools. Refactoring first without any way to test the impact of my changes on the data center team was just asking for everyone involved to have a bad time.

Adding Tests (and Testability)

Some of the most impactful changes we’ve made recently have been around finding ways to test the previously untestable unattended sledgehammer components. Our biggest wins in this area have been:


payload: "sledgehammer-payload-0.5-test-1.x86_64.rpm"
unattended: "true"
unattended_run_recipient: ""
indigo_url: ""

With changes like these in place, we are able to have much more confidence that our changes won’t break the unattended sledgehammer tool that is so critical for our data center team. This enables us to more effectively refactor the Indigo codebase, whether that be to improve it in general or to make it more testable.

I gave a presentation at OpsSchool, our internal series of lectures designed to educate people on a variety of operations-related topics inspired by, on how to change the Indigo code to make it better suited to unit testing. Unit testing itself is beyond the scope of this post, but for us, this has meant things like changing method signatures so that objects that might be mocked or stubbed out can be passed in during tests, or splitting up large gnarly methods that grew organically along with the Indigo codebase over the past few years into smaller, more testable pieces. This way, other people on the team are able to help write unit tests for all of Indigo’s shared library code as well.

Deploying, Monitoring, and Planning

As mentioned previously, one of the biggest headaches with this tooling had been keeping all the different moving pieces in sync when people were making changes. To fix this, we decided to leverage the work that had already been put into Deployinator by our dev tools team. We created an Indigo deployinator stack that, among other things, ensures that the shared libraries, API, command line tools, and sledgehammer payload are all deployed at the same time. It keeps these deploys in sync, handles the building of the payload RPM, and restarts all the Indigo services to make sure that we never again run into issues where the payload stops working because it didn’t get updated when one of its shared library files did or vice versa.

Additionally, it automatically emails release notes to everyone who uses the Indigo toolset, including our data center team. These release notes, generated from the git commit logs for all the commits being pushed out with a given deploy, provide some much-needed visibility into how the tools are changing. Of course, this meant making sure everyone was on the same page with writing commit messages that will be useful in this context! This way the data center folks, geographically removed from the ops team making these changes, have a heads up when things might be changing with the tools they use.

Finally, we’re changing how we approach the continued development and maintenance of this software going forward. Indigo started out as a single ruby script and evolved into a complex interconnected set of tools, but for a while the in-depth knowledge of all the tools and their interconnections existed solely in the heads of a couple people. Going forward, we’re documenting not only how to use the tools but how to develop and test them, and encouraging more members of the team to get involved with this work to avoid having any individuals be single points of knowledge. We’re keeping testability in mind as we write more code, so that we don’t end up with any more code that has to be refactored before it can even be tested. And we’re developing with an eye for the future, planning what features will be added and which bugs are highest priority to fix, and always keeping in mind how the work we do will impact the people who use these tools the most.


Operations engineers don’t think of ourselves as developers, but there’s a lot we can learn from our friends in the development world. Instead of always writing code willy-nilly as needed, we should be planning how to best develop the tooling we use, making sure to be considerate of future-us who will have to maintain and debug this code months or even years down the line.

Tools to provision hardware in a data center need tests and documentation just as much as consumer-facing product code. I’m excited to show that operations engineers can embrace the craftsmanship of software engineering to make our tools more robust and scalable.