Code as Craft: Understand the role of Style in e-commerce shopping

Posted by and on August 2, 2019 / No Responses

Aesthetic style is key to many purchasing decisions. When considering an item for purchase, buyers need to be aligned not only with the functional aspects (e.g. description, category, ratings) of an item’s specification, but also its aesthetic aspects (e.g. modern, classical, retro) as well. Style is important at Etsy, where we have more than 60 million items and hundreds of thousands of them can differ by style and aesthetic. At Etsy, we strive to understand the style preferences of our buyers in order to surface content that best fits their tastes.

Our chosen approach to encode the aesthetic aspects of an item is to label the item with one of a discrete set of “styles” of which “rustic”, “farmhouse”, and “boho” are examples. As manually labeling millions of listings with a style class is not feasible – especially in a marketplace that is ever changing, we wanted to implement a machine learning model that best predicts and captures listings’ styles. Furthermore, in order to serve style-inspired listings to our users, we leveraged the style predictor to develop a mechanism to forecast user style preferences.

Style Model Implementation

Merchandising experts identified style categories.

For this task, the style labels are one of the classes that have been identified by our merchandising experts. Our style model is a machine learning model which, when given a listing and its features (text and images), can output a style label. The style model was designed to not only output these discrete style labels but also a multidimensional vector representing the general style aspects of a listing. Unlike a discrete label (“naval”, “art-deco”, “inspirational”) which can only be one class, the style vector encodes how a listing can be represented by all these style classes in varying proportions. While the discrete style labels can be used in predictive tasks to recommend items to users from particular style classes (say filtering recommended listings to a user from just “art-deco”), the style vector is supposed to serve as a machine learning signal into our other recommendation models. For example, on a listing page on Etsy, we recommend similar items. This model can now surface items that are not only functionally the same (“couch” for another “couch”) but can potentially recommend items that are instead from the same style (“mid-century couch” for a “mid-century dining table”).

The first step in building our listing style prediction model was preparing a training data set. For this, we worked with Etsy’s in-house merchandising experts to identify a list of 43 style classes. We further leveraged search visit logs to construct a “ground truth” dataset of items using these style classes. For example, listings that get a click, add to cart or purchase event for the search query “boho” are assigned the “boho” class label. This gave us a large enough labeled dataset to train a style predictor model.

Style Deep Neural Network

Once we had a ground truth dataset, our task was to build a listing style predictor model that could classify any listing into one of 43 styles (it is actually 42 styles and a ‘everything else’ catch all). For this task, we used a two layer neural network to combine the image and text features in a non-linear fashion. The image features are extracted from the primary image of a listing using a retrained Resnet model. The text features are the TF-IDF values computed on the titles and tags of the items. The image and text vectors are then concatenated and fed as input into the neural network model. This neural network model learns non-linear relationships between text and image features that best predict a listings style. This Neural Network was trained on a GPU machine on Google Cloud and we experimented with the architecture and different learning parameters until we got the best validation / test accuracy.


By explicitly taking style into account the nearest neighbors are more style aligned

User Style

As described above, the style model helps us extract low-dimension embedding vectors that capture this stylistic information for a listing, using the penultimate layer of the neural network. We computed the style embedding vector using the style model for all the listings in Etsy’s corpus.

Given these listing style embeddings, we wanted to understand users’ long-term style preferences and represent it as a weighted average of 42 articulated style labels. For every user, subject to their privacy preferences, we first gathered the entire history of “purchased”, “favorited”, “clicked” and “add to cart” listings in the past three months. From all these listings that a user interacted with, we combined their corresponding style vectors to come up with a final style representation for each user (by averaging them).

Building Style-aware User Recommendations

There are different recommendation modules on Etsy, some of which are personalized for each user. We wanted to leverage user style embeddings in order to provide more personalized recommendations to our users. For recommendation modules, we have a two-stage system: we first generate a candidate set, which is a probable set of listings that are most relevant to a user. Then, we apply a personalized ranker to obtain a final personalized list of recommendations.  Recommendations may be provided at varying levels of personalization to a user based on a number of factors, including their privacy settings.

In this very first iteration of user style aware recommendations, we apply user style understanding to generate a candidate set based on user style embeddings and their latest interacted taxonomies. This candidate set is used for Our Picks For You module on the homepage. The idea is to combine the understanding of a user’s long time style preference with his/her recent interests in certain taxonomies.

This work can be broken down into three steps:

Given user style embeddings, we take top 3 styles with the highest probability to be the “predicted user style”. Latest taxonomies are useful because they indicate users’ recent interests and shopping missions.

Given a taxonomy, sort all the listings in this taxonomy by the different style prediction scores for different classes, high to low. We take the top 100 listings out of these.

Minimal” listings in “Home & Living”

Floral” listings in “Home & Living”

Taxonomy, style validation is to check whether a style makes sense for a certain taxonomy. eg. Hygge is not a valid style for jewelry.

These become the style based recommendations for a user.

1-4: boho + bags_and_purses.backpacks
5-7: boho + weddings.clothing
8,13,16: minimal + bags_and_purses.backpacks

Style Analysis 

We were extremely interested to use our style model to answer questions about users sense of style. Our questions ranged from “How are style and taxonomy related? Do they have a lot in common?”, “Do users care about style while buying items?” to “How do style trends change across the year?”. Our style model enables us to answer at least some of these and helps us to better understand our users. In order to answer these questions and dig further we leveraged our style model and the generated embeddings to perform analysis of transaction data.

Next, we looked at the seasonality effect behind shopping of different styles on Etsy. We began by looking at unit sales and purchase rates of different styles across the year. We observed that most of our styles are definitely influenced by seasonality. For example, “Romantic” style peaks in February because of Valentines Day and “Inspirational” style peaks during graduation season. We tested the unit sales time series of different styles for statistical time series-stationarity test and found that the majority of the styles were non-stationary. This signifies that the majority of styles show different shopping trends throughout the year and don’t have constant unit sales throughout the year. This provided further evidence that users tastes show different trends across the year.




Using the style embeddings to study user purchase patterns not only provided us great evidence that users care about style, but also inspired us to further incorporate style into our machine learning products in the future.

Etsy is a marketplace for millions of unique and creative goods. Thus, our mission as machine learning practitioners is to build pathways that connect the curiosity of our buyers with the creativity of our sellers. Understanding both listing and user styles is another one of our novel building blocks to achieve this goal.

For further details into our work you can read our paper published in KDD 2019.

Authors: Aakash Sabharwal, Jingyuan (Julia) Zhou & Diane Hu


No Comments

An Introduction to Structured Data at Etsy

Posted by on July 31, 2019

Etsy has an uncontrolled inventory; unlike many marketplaces, we offer an unlimited array of one-of-a-kind items, rather than a defined set of uniform goods. Etsy sellers are free to list any policy-compliant item that falls within the three broad buckets of craft supplies, handmade, and vintage. Our lack of standardization, of course, is what makes Etsy special, but it also makes learning about our inventory challenging. That’s where structured data comes in.

Structured vs. Unstructured Data

Structured data is data that exists in a defined relationship to other data. The relation can be articulated through a tree, graph, hierarchy, or other standardized schema and vocabulary. Conversely, unstructured data does not exist within a standardized framework and has no formal relationship to other data in a given space.

For the purposes of structured data at Etsy, the data are the product listings, and they are structured according to our conception of where in the marketplace they belong. That understanding is expressed through the taxonomy.

Etsy’s taxonomy is a collection of hierarchies comprised of 6,000+ categories (ex. Boots), 400+ attributes (ex. Women’s shoe size), 3,500+ values (ex. 7.5), and 90+ scales (ex. US/Canada). These hierarchies form the foundation of 3,500+ filters and countless category-specific shopping experiences on the site. The taxonomy imposes a more controlled view of the uncontrolled inventory — one that engineers can use to help buyers find what they are looking for. 

Building the Taxonomy

The Etsy taxonomy is represented in JSON files, with each category’s JSON containing information about its place in the hierarchy and the attributes, values, and scales for items in that category. Together, these determine what questions will be asked of the seller for listings in that category (Figure A, Box 1), and what filters will be shown to buyers for searches in that category (Figure A, Box 2).

Figure A 
A snippet of the JSON representation of the Jewelry > Rings > Bands category

The taxonomists at Etsy are able to alter the taxonomy hierarchies using an internal tool. This tool supports some unique behaviors of our taxonomy, like inheritance. This means that if a category has a particular filter, then all of its subcategories will inherit that filter as well.

Figure B
Sections of the Jewelry > Rings > Bands category as it appears in our internal taxonomy tool 

Gathering Structured Data: The Seller Perspective

One of the primary ways that we currently collect structured data is through the listing creation process, since that is our best opportunity to learn about each listing from the person who is most familiar with it: the seller!

Sellers create new listings using the Shop Manager. The first step in the listing process is to choose a category for the listing from within the taxonomy. Using auto-complete suggestions, sellers can select the most appropriate category from all of the categories available. 

Figure C 
Category suggestions for “ring”

At this stage in the listing creation process, optional attribute fields appear in the Shop Manager. This is also enabled by the taxonomy JSON, in that the fields correspond with the category selected by the seller (see Figure A, Box 1). This behavior ensures that we are only collecting relevant attribute data for each category and simplifies the process for sellers. Promoting this use of standardized data also reduces the need for overloaded listing titles and descriptions by giving sellers a designated space to tell buyers about the details of their products. Data collected during the listing creation process appears on the listing page, highlighting for the buyer some of the key, standardized details of the listing.

Figure D
Some of the attribute fields that appear for listings in Jewelry > Rings > Bands (see Figure A, Box 1 for the JSON that powers the Occasion attribute)

Making Use of Structured Data: The Buyer Perspective

Much of the buyer experience is a product of the structured data that has been provided by our sellers. For instance, a given Etsy search yields category-specific filters on the left-hand navigation of the search results page. 

Figure E
Some of the filters that appear upon searching for “Rings”

Those filters should look familiar! (see Figure D) They are functions of the taxonomy. The search query gets classified to a taxonomy category through a big data job, and filters affiliated with that category are displayed to the user (see Figure F below). These filters allow the buyer to narrow down their search more easily and make sense of the listings displayed.

Figure F
The code that displays category-specific filters upon checking that the classified category has buyer filters defined in its JSON (see Figure A, Box 2 for a sample filter JSON)

Structuring Unstructured Data

There are countless ways of deriving structured data that go beyond seller input. First, there are ways of converting unstructured data that has already been provided, like listing titles or descriptions, into structured data. Also, we can use machine learning to learn about our listings and reduce our dependence on seller input. We can, for example, learn about the color of a listing through the image provided; we can also infer more nuanced data about a listing, like its seasonality or occasion.

We can continue to measure the relevance of our structured data through metrics like the depth of our inventory categorization within our taxonomy hierarchies and the completeness of our inventory’s attribution.

All of these efforts allow us to continue to build deeper category-specific shopping experiences powered by structured data. By investing in better understanding our inventory, we create deeper connections between our sellers and our buyers.

No Comments

What it’s Like to Intern at Etsy? – Part I

Posted by on March 18, 2019 / 1 Comment

I secretly like seeing people’s surprise when I told them that I chose to intern at Etsy because it was the only company that asked for a cover letter. I enjoyed every second of filling out my Etsy software engineering internship application because I felt like I was really telling my story to a company that cared about my whole self. I interned at Etsy during summer 2016 and started working full-time after I graduated from college in 2017. The human touch embedded in Etsy’s engineering culture, business strategy and company vision is still the number one thing I am proud of.

Over the past three years, I have gotten many questions about what it’s like to intern and have my first job out of college at Etsy. It always gives me a warm feeling when students are curious and excited about careers at Etsy, and I think it’s time we give this question answers that will live on the interweb.

This past winter, I met five interns that Etsy hosted for WiTNY (Women in Technology and Entrepreneurship in New York)’s Winternship program. At the end of their three-week internships, they were super excited to share their experiences. One of the winterns, Nia Laureano, wrote a fantastic recap of her time at Etsy, and I thought it would be a great way to start sharing the Etsy internship experience!

Inventing a Process: Five Interns Navigate a Complex Problem Thanks to Etsy’s Relentlessly Human Touch

by Nia Laureano

Interning at Etsy is a unique experience because so much about Etsy’s identity has to be understood in order to move forward with any sort of work. For three weeks in January, a team of four girls and I joined the Etsy family as interns and were tasked with solving an issue being faced by a product team that helps the buyers.

Coming into our first day, the details of our task were overwhelmingly foreign to us. The subject we were dealing with was Etsy’s listing page. It’s a complicated page, due to the fact that 60 million listings exist on Etsy and they are all vastly different. When engineers make changes to the listing page, it is difficult to test their code against every possible variation of the page that exists. Sometimes, a variation slips from their mind — they forget to account for it, which could potentially cause the page to break when they push code. This is what engineers call an edge case, and our job was to create a tool that allows Etsy engineers to test for edge cases more thoroughly. Specifically, we were asked to create a reference for them to easily find listings that match different criteria and variations to test code against. But solving a project that we barely understood ourselves seemed daunting, if not impossible.

The entirety of our first week was spent immersing ourselves in the context of this world we were working in. We strolled through the typical workflow of an Etsy engineer, trying to imagine where our solution would fit neatly into the puzzle. We spoke to engineers about their frustrations to get to the root of their needs. We became engineers by being thrusted into the process of pushing code to Etsy’s repository. We couldn’t commit to our craft without first understanding how these employees live and work; then we had to imagine what we could do to make their world better.

After interviewing several engineers, we realized that they each had their own ways of testing for edge cases. “I just have two folders of bookmarks that have some links in them,” said one engineer. “But I’m not sure what other people use.” It was surprising to hear that engineers weren’t sure what other people on their team were doing. We realized, at this point, that the problem wasn’t a faulty process — there was no process to begin with. It was up to us to invent a process, or at least establish a basic standard when it comes to testing for edge cases.

In ideation, the solutions we envisioned ranged dramatically. Something as basic as a spreadsheet would have been helpful, but we also dreamed bigger. We thought about creating an automated Etsy shop that auto-generates listings that represent the edge cases that needed to be tested. We wanted to create something ambitious, but it also had to be something we could attain in three weeks. Ultimately, we focused on creating a solution that would deliver on three crucial needs of our engineers: structure, convenience and confidence.

Structure. While some engineers relied on their own bookmarks or spreadsheets to keep track of edge cases, some relied on sheer memory, or asking their coworkers via Slack. Testing for something that could potentially break the listing page, we realized, shouldn’t be such a structureless process. Our solution needed to provide an element of uniformity; it needed to eliminate that glaring unawareness about what other teammates were doing. It needed to be a unifier.

Convenience. In order to make a tool that was accessible and easy to use, we needed to identify and understand the environment in which engineers complete the bulk of their work, because that’s where we would want our tool to live. We quickly noticed one common thread woven through the workflow of not only Etsy’s engineers, but the company as a whole: our messaging platform, Slack. We observed that so much important work at Etsy is already accomplished via Slack; it’s where employees collaborate and even push code. It made perfect sense for our solution to be integrated within the environment that was already so lived-in.

Confidence. Bugs are inevitable, but our engineers deserve to feel confident that the code they are pushing is as clean as it can be. The more edge cases they can test for, the more certain they can feel that their code is quality and fully functional. Therefore, our solution had to be thorough and reliable; it had to be something engineers could trust.

After three weeks, our project was completed in two phases. Our first phase was focused on creating a spreadsheet. This was the skeleton of our final product, which mirrored the anatomy of the listing page itself. To build this, we broke down the different components of the listing page and identified all of the variations that could occur within those components. Then, we spent several days creating almost one hundred of our own listings on Etsy that represented each of those variations. We ended up with a thorough, intuitively structured catalog of edge cases which can now be accessed by anyone at Etsy who needs it.

The second phase of our project was a Slack-integrated bot. Using our spreadsheet as a backbone, we aimed to design a bot that can retrieve edge cases on command via Slack. Engineers can input commands that return single, multiple, or all edge cases they may be looking for. Due to our time constraint, we were only able to create a bot that utilizes test data, but we hope to see a future iteration that fully integrated with our spreadsheet.

A universe of terminology and culture had to be packed into our brains in order to accomplish what we did in three weeks. Yet, we somehow felt so seamlessly integrated into Etsy’s ecosystem from day one, thanks to the friendly and enthusiastic nature of everyone around us. We were never afraid to ask questions, because no one ever talked down to us or made us feel inferior. There are no mechanisms in place at Etsy that make power dynamics apparent, not even from the perspective of an intern.

Our project was completed not because of crash courses in PHP or because we overloaded on cold brew; it was thanks to the people who nurtured us along the way. It was the prospect of creating something that could make a lasting impact on a company we loved that motivated us. Etsy’s relentlessly human touch makes even the smallest of projects feel meaningful, and it can turn three weeks into an unforgettable experience that I will never stop feeling passionate about.

A note about our internship & our organization:

WiTNY (Women in Technology and Entrepreneurship in New York) is a collaborative initiative between Cornell Tech x CUNY designed to inspire young women to pursue careers in technology. WiTNY offers workshops and program that teach important skills and provide real work experience.

The Winternship program is a paid, three-week, mini-internship for first and second-year undergraduate students at CUNY schools, during their January academic recess. Etsy is one of many companies who participated in the Winternship program this year, taking a team of five young women and giving them a challenging project to complete while also teaching them about the different roles within a tech company.

 

1 Comment

Executing a Sunset

Posted by on February 1, 2019 / 5 Comments

We all know how exciting it is to build new products, the thrill of a pile of new ideas waiting to be tested, new customers to reach, knotty problems to solve, and dreams of upward-sloping graphs.  But what happens when it is no longer aligned with the trajectory of the company. Often, the product, code, and infrastructure become a lower priority, while the team moves on to the next exciting new venture. In 2018, Etsy sunset Etsy Wholesale, Etsy Studio, and Etsy Manufacturing, three customer-facing products.

In this blog post, we will explore how we sunset these products at Etsy. This process involves a host of stakeholders including marketing, product, customer support, finance and many other teams, but the focus of this blog post is on engineering and the actual execution of the sunset.

ProcessPre-code deletion

Use Feature Flags and Turn off Traffic

Once the communication had been done through emails, in-product announcements, and posts in the user forums, we started focusing on the execution. Prior to the day of each sunset, we used our feature flagging infrastructure to build a switch to disable access to the interface for Wholesale and Manufacturing. Feature flags are an integral part of the continuous deployment process at Etsy. Feature flags reinforce the benefits of small changes and continuous delivery.

On the day of the sunset, all we had to do was deploy a one line configuration change and the product was shut off since there was a feature flag that controlled access to these products.

A softer transition is often preferable to a hard turn off. For example, we disabled the ability for buyers to create new orders one month before shutting Etsy Wholesale off. That gave sellers a chance to service the orders that remained on-platform, avoiding a mad-dash at the end.

Export Data for Users

Once the Etsy Wholesale platform was turned off, we created data export files for each seller and buyer with information about every order they received or placed during the five years that the platform was active. Generating and storing these files in one shot allowed us to clean up the wholesale codebase without fear that parts of it would be needed later for exporting data.

Set Up Redirects

We highly recommend redirects through feature flags,  but a hard DNS redirect might be required in some circumstances. The sunset of Etsy Studio was complicated by the fact that in the middle of this project, etsy.com was being migrated from on-premise hosting to the cloud. To reduce complexity and risk for the massive cloud migration project, Etsy Studio had to be shut off before the migration began. On the day before the cloud migration, a DNS redirect was made to forward any request on etsystudio.com to a special page on etsy.com that explained that Etsy Studio was being shut down. Once the DNS change went live, it effectively shut off Etsy Studio completely.

Code Deletion Methodology:

Once we confirmed that all three products were no longer receiving traffic, we kicked off the toughest part of the engineering process: deleting all the code. We tried to phase it in two parts, as tightly and loosely integrated products. Integrations in the most sensitive/dangerous spots were prioritized, and safer deletions were done later as we were heading into the holiday season (our busiest time of the year).

For Etsy Wholesale and Etsy Manufacturing, we had to remove the code piece-by-piece because it was tightly integrated with other features on the site. For Etsy Studio, we thought we would be able to delete the code in one massive commit. One benefit of our continuous integration system is that we can try things out, fail, and revert without negatively affecting our users. This proved valuable as when we tried deleting the code one massive commit, some unit tests for the Etsy codebase started failing. We realized that small dependencies between the code had formed over time. We decided to delete the code in smaller, easier to test, chunks.
 

a small example of dependencies creeping in where you least expect them.

Challenges: Planning (or lack of it) for slowdowns

Interdependencies

During the process of sunsetting, we didn’t consider how busy other teams would be heading into the holiday season. This slowed down the process of getting code reviews approved. This became especially crucial for us since we were removing and modifying big chunks of code maintained by other teams.

There were also several other big projects in flight while we were trying to delete code across our code base and that slowed us down. One example that I already mentioned was cloud migration: we couldn’t shut off Etsy Studio using a config flag and we had to work around it.

Commit Size and Deploys

To reduce risk, our intention was to keep our commits small, but when trying to delete so much code at once, it’s hard to keep all your commits small. Testing and deploying was a least 50% of our team’s time. Our team made about 413 commits over five months, deleting 275,000 lines of code. That averages out to 630 lines of code deleted per commit, which were frequently deployed one at a time.

Compliance

We actively think of compliance when building new things, but it is also important to keep in mind compliance requirements when you delete code. Etsy’s SOX compliance system requires that certain files in our codebase are subject to extra controls. When we deploy changes to such files, we need additional reviews and signoffs. We had to do 44 SOX reviews since we did multiple small commits. Each review requires approvals by multiple people and this added on average a few hours to each bit of deletion we did.  Similarly, we considered user privacy and data protection in how to make retention decisions about sunsetted products, how to make data available for export, and how it impacts our terms and policies.

Deleting so much code can be a difficult process. We had to revert changes from production at least five times, which, for the most part was simple. One of these five reverts was complicated by a data corruption issue affecting a small population of sellers, which required several days of work to write, test, and run a script to fix the problem.

The Outcome

We measured success using the following metrics:

From 1000s of error logs a day for wholesale, to less than 100 (eventually we got this to zero)

The roots that three products had in our systems demonstrated the challenges in building and maintaining a standalone product alongside our core marketplace. The many branching pieces of logic that snuck in made it difficult to reuse lots of existing code. By deleting 275,000 lines of code, we were able to reduce tech debt and remove roadblocks for other engineers.   

5 Comments

Why Diversity Is Important to Etsy

Posted by on January 7, 2019 / 2 Comments

We recently published our company’s Guiding Principles. These are five common guideposts that apply to all organizations and departments within Etsy. We spent a great deal of time discussing, brainstorming, and editing these. By one estimate, over 30% of the company had some input at some phase of the process. This was a lot of effort by a lot of people but this was important work. These principles need to not only reflect how we currently act but at the same time they need to be aspirational for how we want to behave. These principles will be used in performance assessments, competency matrices, interview rubrics, career discussions, and in everyday meetings to refocus discussions.

One of the five principles is focused on diversity and inclusion. The principle states:

We embrace differences.

Diverse teams are stronger, and inclusive cultures are more resilient. When we seek out different perspectives, we make better decisions and build better products.

Why would we include diversity and inclusion as one of our top five guiding principles? One reason is that Etsy’s mission is to Keep Commerce Human. Etsy is a very mission-driven company. Many of our employees joined and remain with us because they feel so passionate about the mission. Every day, we keep commerce human by helping creative entrepreneurs find buyers who become committed fans of the seller’s art, crafts, and collections. The sellers themselves are a diverse group of individuals from almost every country in the world. We would have a hard time coming to work if the way we work, the way we develop products, the way we provide support, etc. isn’t done in a manner that supports this mission. Failing to be diverse and inclusive would fail that mission.

Besides aligning with our mission, there are other reasons that we want to have diverse teams. Complicated systems, which feature unpredictable, surprising, and unexpected behaviors have always existed. Complex systems, however, have gone from something found mainly in large systems, such as cities, to almost everything we interact with today. Complex systems are far more difficult to manage than merely complicated ones as subsystems interact in unexpected ways making it harder to predict what will happen. Our engineers deal with complex systems on a daily basis. Complexity is a bit of an overloaded term, but scholarly literature generally categorizes it into three major groups, determined according to the point of view of the observer: behavioral, structural, and constructive.1 Between the website, mobile apps, and systems that support development, our engineers interact with highly complex systems from all three perspectives every day. Research has consistently shown that diverse teams are better able to manage complex systems.2

We recently invited Chris Clearfield and András Tilcsik, the authors of Meltdown (Penguin Canada, 2018), to speak with our engineering teams. The book and their talk contained many interesting topics, most based on Charles Perrow’s book, Normal Accident Theory (Princeton University Press; revised ed. 1999). However, perhaps the most important topic was based on a series of studies performed by Evan Apfelbaum and his colleagues at MIT. This study revealed that as much as we’re predisposed to agree with a group, our willingness to disagree increases dramatically if the group is diverse.3 According to Clearfield and Tilcsik, homogeneity may facilitate “smooth, effortless interactions,” but diversity drives better decisions. Interestingly, it’s the diversity and not necessarily the specific contributions of the individuals themselves, that causes greater skepticism, more open and active dialogue, and less group-think. This healthy skepticism is incredibly useful in a myriad of situations. One such situation is during pre-mortems, where a project team imagines that a project has failed and works to identify what potentially could lead to such an outcome. This is very different from a postmortem where the failure has already occurred and the team is dissecting the failure. Often individuals who have been working on projects for weeks or more are biased with overconfidence and the planning fallacy. This exercise can help ameliorate these biases and especially when diverse team members participate. We firmly believe that when we seek out different perspectives, we make better decisions, build better products, and manage complex systems better.

Etsy Engineering is also incredibly innovative. One measure of that is the number of open source projects on our GitHub page and the continuing flow of contributions from our engineers in the open source community. We are of course big fans of open source as Etsy, like most modern platforms, wouldn’t exist in its current form without the myriad of people who have solved a problem and published their code under an open source license. But we also view this responsibility to give back as part of our culture. Part of everyone’s job at Etsy is making others better. It has at times been referred to as “generosity of spirit”, which to engineers means that we should be mentoring, teaching, contributing, speaking, writing, etc.  

Another measure of our innovation is our experiment velocity. We often run dozens of simultaneous experiments in order to improve the buyer and seller experiences. Under the mission of keeping commerce human, we strive every day to develop and improve products that enable 37M buyers to search and browse through 50M+ items to find just the right, special piece. As you can imagine, this takes some seriously advanced technologies to work effectively at this scale. And, to get that correct we need to experiment rapidly to see what works and what doesn’t. Fueling this innovation is the diversity of our workforce.

Companies with increased diversity unlock innovation by creating an environment where ideas are heard and employees can find senior-level sponsorship for compelling ideas. Leaders are twice as likely to unleash value-driving insights if they give diverse voices equal opportunity.4

So diversity fits our mission, helps manage complex systems, and drives greater innovation, but how is Etsy doing with respect to diversity? More than 50% of our Executive Team and half of our Board of Directors are women. More than 30% of Etsy Engineers identify as women/non-binary and more than 30% are people of color.5 These numbers are industry-leading, especially when compared to other tech companies who report “tech roles” and not the more narrow category, “engineering” roles. Even though we’re proud of our progress, we’re not fully satisfied. In October 2017, we announced a diversity impact goal to “meaningfully increase representation of underrepresented groups and ensure equity in Etsy’s workforce.” To advance our goal, we are focused on recruiting, hiring, retention, employee development, mentorship, sponsorship, and building an inclusive culture.

We have been working diligently on our recruiting and hiring processes. We’ve rewritten job descriptions, replaced some manual steps in the process with third-party vendors, and changed the order of steps in the interview process, all in an effort to recruit and hire the very best engineers without bias. We have also allocated funding and people in order to sponsor and attend conferences focused on underrepresented groups in tech. We’ll share our 2018 progress in Q1 2019.

Once engineers are onboard, we want them to bring their whole selves to work in an inclusive environment that allows them to thrive and be their best. One thing that we do to help with this is to promote and partner directly with employee resource groups (ERGs). Our ERGs include Asian Resource Community, Black Resource and Identity Group at Etsy, Jewish People at Etsy, Hispanic Latinx Network, Parents ERG, Queer@Etsy, and Women and NonBinary People in Tech. If you’re not familiar with ERGs, their mission and goals are to create a positive and inclusive workplace culture where employees from underrepresented backgrounds, lifestyles, and abilities have access to programs that foster a sense of community, contribute to professional development, and amplify diverse voices within our organization. Each of these ERGs has an executive sponsor. This ensures that there is a communication channel with upper management. It also highlights the value that we place upon the support that these groups provide.    

We are also focused on retaining our engineers. One of the things that we do to help in this area is to monitor for discrepancies that might indicate bias. During our compensation, assessment, and promotion cycles, we evaluate for inconsistencies. We perform this analysis both internally and through the use of third parties.  

Etsy Engineering has been a leader and innovator in the broader tech industry with regard to technology and process. We also want to be leaders in the industry with regards to diversity and inclusion. It is not only the right thing to do but it’s the right thing to do for our business. If this sounds exciting to you, we’d love to talk, just click here to learn more.

 

Endnotes:

1 Wade, J., & Heydari, B. (2014). Complexity: Definition and reduction techniques. In Proceedings of the Poster Workshop at the 2014 Complex Systems Design & Management International Conference.
2 Sargut, G., & McGrath, R. G. (2011). Learning to live with complexity. Harvard Business Review, 89(9), 68–76
3 Apfelbaum EP, Phillips KW, Richeson JA (2014) Rethinking the baseline in diversity research: Should we be explaining the effects of homogeneity? Perspect Psychol Sci 9(3):235–244.
4 Hewlett, S. A., Marshall, M., & Sherbin, L. (2013). How diversity can drive innovation. Harvard Business Review.
5 Etsy Impact Update (August 2018). https://extfiles.etsy.com/Impact/2017EtsyImpactUpdate.pdf

2 Comments

boundary-layer : Declarative Airflow Workflows

Posted by on November 14, 2018

When Etsy decided last year to migrate our operations to Google Cloud Platform (GCP), one of our primary motivations was to enable our machine learning teams with scalable resources and the latest big-data and ML technologies. Early in the cloud migration process, we convened a cross-functional team between the Data Engineering and Machine Learning Infrastructure groups in order to design and build a new data platform focused on this goal.

One of the first choices our team faced was how to coordinate and schedule jobs across a menagerie of new technologies. Apache Airflow (incubating) was the obvious choice due to its existing integrations with GCP, its customizability, and its strong open-source community; however, we faced a number of open questions that had to be addressed in order to give us confidence in Airflow as a long-term solution.

First, Etsy had well over 100 existing Hadoop workflows, all written for the Apache Oozie scheduler. How would we migrate these to Airflow? Furthermore, how would we maintain equivalent copies of Oozie and Airflow workflows in parallel during the development and validation phases of the migration, without requiring our data scientists to pause their development work?

Second, writing workflows in Airflow (expressed in python as directed acyclic graphs, or DAGs) is non-trivial, requiring new and specialized knowledge. How would we train our dozens of internal data platform users to write Airflow DAGs? How would we provide automated testing capabilities to ensure that DAGs are valid before pushing them to our Airflow instances? How would we ensure that common best-practices are used by all team members? And how would we maintain and update those DAGs as new practices are adopted and new features made available?

Today we are pleased to introduce boundary-layer, the tool that we conceived and built to address these challenges, and that we have released to open-source to share with the Airflow community.

Introduction: Declarative Workflows

Boundary-layer is a tool that enables data scientists and engineers to write Airflow workflows in a declarative fashion, as YAML files rather than as python. Boundary-layer validates workflows by checking that all of the operators are properly parameterized, all of the parameters have the proper names and types, there are no cyclic dependencies, etc. It then translates the workflows into DAGs in python, for native consumption by Airflow.

Here is an example of a very simple boundary-layer workflow:

name: my-dag-1

default_task_args:
  start_date: '2018-10-01'

operators:
- name: print-hello
  type: bash
  properties:
    bash_command: "echo hello"
- name: print-world
  type: bash
  upstream_dependencies:
  - print-hello
  properties:
    bash_command: "echo world"

Boundary-layer translates this into python as a DAG with 2 nodes, each consisting of a BashOperator configured with the provided properties, as well as some auto-inserted parameters:

# Auto-generated by boundary-layer

import os
from airflow import DAG

import datetime

from airflow.operators.bash_operator import BashOperator

DEFAULT_TASK_ARGS = {
        'start_date': '2018-10-01',
    }

dag = DAG(
        dag_id = 'my_dag_1',
        default_args = DEFAULT_TASK_ARGS,
    )

print_hello = BashOperator(
        dag = (dag),
        bash_command = 'echo hello',
        start_date = (datetime.datetime(2018, 10, 1, 0, 0)),
        task_id = 'print_hello',
    )


print_world = BashOperator(
        dag = (dag),
        bash_command = 'echo world',
        start_date = (datetime.datetime(2018, 10, 1, 0, 0)),
        task_id = 'print_world',
    )

print_world.set_upstream(print_hello)

Note that boundary-layer inserted all of the boilerplate of python class imports and basic DAG and operator configuration. Additionally, it validated parameter names and types according to schemas, and applied type conversions when applicable (in this case, it converted date strings to datetime objects).

Generators

Moving from python-based to configuration-based workflows naturally imposes a functionality penalty. One particularly valuable feature of python-based DAGs is the ability to construct them dynamically: for example, nodes can be added and customized by iterating over a list of values. We make extensive use of this functionality ourselves, so it was important to build a mechanism into boundary-layer to enable it.

Boundary-layer generators are the mechanism we designed for dynamic workflow construction. Generators are complete, distinct sub-workflows that take a single, flexibly-typed parameter as input. Each generator must prescribe a mechanism for generating a list of values: for example, lists of items can be retrieved from an API via an HTTP GET request. The python code written by boundary-layer will iterate over the list of generator parameter values and create one instance of the generator sub-workflow for each value. Below is an example of a workflow that incorporates a generator:

name: my-dag-2

default_task_args:
  start_date: '2018-10-01'

generators:
- name: retrieve-and-copy-items
  type: requests_json_generator
  target: sense-and-run
  properties:
    url: http://my-url.com/my/file/list.json
    list_json_key: items

operators:
- name: print-message
  type: bash
  upstream_dependencies:
  - retrieve-and-copy-items
  properties:
    bash_command: echo "all done"
---
name: sense-and-run

operators:
- name: sensor
  type: gcs_object_sensor
  properties:
    bucket: <<item['bucket']>>
    object: <<item['name']>>
- name: my-job
  type: dataproc_hadoop
  properties:
    cluster_name: my-cluster
    region: us-central1
    main_class: com.etsy.jobs.MyJob
    arguments:
    - <<item['name']>>

This workflow retrieves the content of the specified JSON file, extracts the items field from it, and then iterates over the objects in that list, creating one instance of all of the operators in the sense-and-run sub-graph per object.

Note the inclusion of several strings of the form  << ... >>.  These are boundary-layer verbatim strings, which allow us to insert inline snippets of python into the rendered DAG. The item value is the sub-workflow’s parameter, which is automatically supplied by boundary-layer to each instance of the sub-workflow.

Also note that generators can be used in dependency specifications, as indicated by the print-message operator’s upstream_dependencies block. Generators can even be set to depend on other generators, which boundary-layer will encode efficiently, without creating a combinatorially-exploding set of edges in the DAG.

Advanced features

Under the hood, boundary-layer represents its workflows using the powerful networkx library, and this enables a variety of features that require making computational modifications to the graph, adding usability enhancements that go well beyond the core functionality of Airflow itself.

A few of the simpler features that modify the graph include before and after sections of the workflow, which allow us to specify a set of operators that should always be run upstream or downstream of the primary list of operators. For example, one of our most common patterns in workflow construction is to put various sensors in the before block, so that it is not necessary to specify and maintain explicit upstream dependencies between the sensors and the primary operators. Boundary-layer automatically attaches these sensors and adds the necessary dependency rules to make sure that no primary operators execute until all of the sensors have completed.

Another feature of boundary-layer is the ability to prune nodes out of workflows, while maintaining all dependency relationships between the nodes that remain. This was especially useful during the migration of our Oozie workflows. It allowed us to isolate portions of those workflows for running in Airflow and gradually add more portions in stages, until the workflows were fully migrated, without ever having to create the portioned workflows as separate entities.

One of the most useful advanced features of boundary-layer is its treatment of managed resources. We make extensive use of ephemeral, workflow-scoped Dataproc clusters on the Etsy data platform. These clusters are created by Airflow, shared by various jobs that Airflow schedules, and then deleted by Airflow once those jobs are complete. Airflow itself provides no first-class support for managed resources, which can be tricky to configure properly: we must make sure that the resources are not created before they are needed, and that they are deleted as soon as they are not needed anymore, in order to avoid accruing costs for idle clusters. Boundary-layer handles this automatically, computing the appropriate places in the DAG into which to splice the resource-create and resource-destroy operations. This makes it simple to add new jobs or remove old ones, without having to worry about keeping the cluster-create and cluster-destroy steps always installed in the proper locations in the workflow.

Below is an example of a boundary-layer workflow that uses Dataproc resources:

name: my-dag-3

default_task_args:
  start_date: '2018-10-01'
  project_id: my-gcp-project

resources:
- name: dataproc-cluster
  type: dataproc_cluster
  properties:
    cluster_name: my-cluster
    region: us-east1
    num_workers: 128

before:
- name: sensor
  type: gcs_object_sensor
  properties:
    bucket: my-bucket
    object: my-object

operators:
- name: my-job-1
  type: dataproc_hadoop
  requires_resources:
  - dataproc-cluster
  properties:
    main_class: com.etsy.foo.FooJob
- name: my-job-2
  type: dataproc_hadoop
  requires_resources:
  - dataproc-cluster
  upstream_dependencies:
  - my-job-1
  properties:
    main_class: com.etsy.bar.BarJob
- name: copy-data
  type: gcs_to_gcs
  upstream_dependencies:
  - my-job-2
  properties:
    source_bucket: my-bucket
    source_object: my-object
    dest_bucket: your-bucket

In this DAG, the gcs_object_sensor runs first, then the cluster is created, then the two hadoop jobs run in sequence, and then the job’s output is copied while the cluster is simultaneously deleted.

Of course, this is just a simple example; we have some complex workflows that manage multiple ephemeral clusters, with rich dependency relationships, all of which are automatically configured by boundary-layer. For example, see the figure below: this is a real workflow that runs some hadoop jobs on one cluster while running some ML training jobs in parallel on an external service, and then finally runs more hadoop jobs on a second cluster. The complexity of the dependencies between the training jobs and downstream jobs required boundary-layer to insert several flow-controloperators in order to ensure that the downstream jobs start only once all of the upstream dependencies are met.

Conversion from Oozie

One of our primary initial concerns was the need to be able to migrate our Oozie workflows to Airflow. This had to be an automated process, because we knew we would have to repeatedly convert workflows in order to keep them in-sync between our on-premise cluster and our GCP resources while we developed and built confidence in the new platform. The boundary-layer workflow format is not difficult to reconcile with Oozie’s native configuration formats, so boundary-layer is distributed with a parser that does this conversion automatically. We built tooling to incorporate the converter into our CI/CD processes, and for the duration of our cloud validation and migration period, we maintained perfect fidelity between on-premise Oozie and cloud-based Airflow DAGs.

Extensibility

A final requirement that we targeted in the development of boundary-layer is that it must be easy to add new types of operators, generators, or resources. It must not be difficult to modify or add to the operator schemas or the configuration settings for the resource and generator abstractions. After all, Airflow’s huge open-source community (including several Etsy engineers!) ensures that its list of supported operators is growing practically every day. In addition, we have our own proprietary set of operators for Etsy-specific purposes, and we must keep the configurations for these out of the public boundary-layer distribution. We satisfied these requirements via two design choices.

First, every operator, generator, or resource is represented by a single configuration file, and these files get packaged up with boundary-layer. Adding a new operator/generator/resource is accomplished simply by adding a new configuration file. Here is an example configuration, in this case for the AirflowBashOperator:

name: bash
operator_class: BashOperator
operator_class_module: airflow.operators.bash_operator
schema_extends: base

parameters_jsonschema:
  properties:
    bash_command:
      type: string
    
    xcom_push:
      type: boolean
    
    env:
      type: object
      additionalProperties:
        type: string
    
    output_encoding:
      type: string
  
  required:
  - bash_command
  
  additionalProperties: false

We use standard JSON Schemas to specify the parameters to the operator, and we use a basic single-inheritance model to centralize the specification of common parameters in theBaseOperator, as is done in the Airflow code itself.

Second, we implemented a plugin mechanism based on python’s setuptools entrypoints. All of our internal configurations are integrated into boundary-layer via plugins. We package a single default plugin with boundary layer that contains configurations for common open-source Airflow operators. Other plugins can be added by packaging them into separate python packages, as we have done internally with our Etsy-customized plugin. The plugin mechanism has grown to enable quite extensive workflow customizations, which we use at Etsy in order to enable the full suite of proprietary modifications used on our platform.

Conclusion

The boundary-layer project has been a big success for us. All of the nearly-100 workflows that we deploy to our production Airflow instances are written as boundary-layer configurations, and our deployment tools no longer even support python-based DAGs. Boundary-layer’s ability to validate workflow configurations and abstract away implementation details has enabled us to provide a self-service Airflow solution to our data scientists and engineers, without requiring much specialized knowledge of Airflow itself. Over 30 people have contributed to our internal Airflow workflow repository, with minimal process overhead (Jenkins is the only “person” who must approve pull requests), and without having deployed a single invalid DAG.

We are excited to release boundary-layer to the public, in hopes that other teams find it similarly useful. We are committed to supporting it and continuing to add new functionality, so drop us a github issue if you have any requests. And of course, we welcome community contributions as well!

No Comments

Double-bucketing in A/B Testing

Posted by on November 7, 2018 / No Responses

Previously, we’ve posted about the importance we put in Etsy’s experimentation systems for our decision-making process. In a continuation of that theme, this post will dive deep into an interesting edge case we discovered.

We ran an A/B test which required a 5% control variant and 95% treatment variant rather than the typical split of 50% for control and treatment variants.  Based on the nature of this particular A/B test, we expected a positive change for conversion rate, which is the percent of users that make a purchase.

At the conclusion of the A/B test, we had some unexpected results. Our A/B testing tool, Catapult, showed the treatment variant “losing” to the control variant.  Catapult was showing a negative change in conversion rate when we’d expect a positive rate of change.

Due to these unexpected negative results, the Data Analyst team investigated why this was happening. This quote summarizes their findings

The control variant “benefited” from double-bucketing because given its small size (5% of traffic), receiving an infusion of highly engaged browsers from the treatment provided an outsized lift on its aggregate performance.

With the double-bucketed browsers excluded, the true conversion rate of change is positive which is the results that we expected from the A/B test.  Just 0.02% of the total browsers in the A/B test were double-bucketed. This small percentage of the total browsers had a large impact on the A/B test results.  This post will cover the details of why that occurred.

Definition of Double-bucketing

So what exactly is double-bucketing?

In an A/B test, a user is shown either the control or treatment experience. The process to determine which variant the user falls into is called ‘bucketing’. Normally, a user experiences only the control or only the treatment; however in this A/B test, there was a tiny percentage of users who experienced both variants. We call this error in bucketing ‘double-bucketing’.

Typical user 50/50 bucketing for an A/B test puts ½ of the users into the control variant and ½ into the treatment variant. Those users stay in their bucketed variant. We calculate metrics and run statistical tests by summing all the data for the users in each variant.

However, the double-bucketing error we discovered would place the last 2 users in both control and treatment variants, as shown below. Now those users’ data is counted in both variants for statistics on all metrics in the experiment.

How browsers are bucketed

Before discussing the cases of double-bucketing that we found, it helps to have a high-level understanding of how A/B test bucketing works at Etsy.

For etsy.com web requests, we use an unique identifier from the user’s browser cookie which we refer to as “browser id”.  Using the string value from the cookie, our clickstream data logic, named EventPipe, sets the browser id property on each event.

Bucketing is determined by a hash. First we concatenate the name of the A/B test and the browser id.  The name of the A/B test is referred to as the “configuration flag”. That string is hashed using SHA-256 and then converted to an integer between 0 and 99. For a 50% A/B test, if the value is < 50, the browser is bucketed into the treatment variant. Otherwise, the browser is in the control variant.  Because the hashing function is deterministic, the user should be bucketed into the same variant of an experiment as long as the browser cookie remains the same.

EventPipe adds the configuration flag and bucketed variant information to the “ab” property on events.

For an A/B test’s statistics in Catapult, we filter by the configuration flag and then group by the variant.

This bucketing logic is consistent and has worked well for our A/B testing for years.  Although occasionally some experiments wound up with small numbers of double-bucketed users, we didn’t detect a significant impact until this particular A/B test with a 5% control.

Some Example Numbers (fuzzy math)

We’ll use some example numbers with some fuzzy math to understand how the conversion rate was effected so much by only 0.02% double-bucketed browsers.

For most A/B tests, we do 50/50 bucketing between the control variant and treatment variants. For this A/B test, we did a 5% control which puts 95% in the treatment.

If we start with 1M browsers, our 50% A/B test has 500K browsers in both control and treatment variants. Our 5% control A/B test has 50K browsers in the control variant and 950K in the treatment variant.

Let’s assume a 10% conversion rate for easy math. For the 50% A/B test, we have 50K converted browsers in both the control and treatment variant. Our 5% control A/B test has 5K converted browsers in the control variant and 95K in the treatment variant.

For the next step, let’s assume 1% of the converting browsers are double-bucketed. When we add the double-bucketed browsers from the opposite variant to both the numerator and denominator, we get a new conversion rate. For our 50% A/B test, that is 50,500 converted browsers in both the control and treatment variants. The new conversion rate is slightly off from the expected conversion rate but only by 0.1%.

For our 5% control A/B test, the treatment variant’s number of converted browsers only increased by 50 browsers from 95,000 to 95,050. The treatment variant’s new conversion rate still rounds to the expected 10%.

But for our 5% control A/B test, the control variant’s number of converted browsers jumps from 5000 to 5950 browsers. This causes a huge change in the control variant’s conversion rate – from 10% to 12% – while the treatment variant’s conversion rate was unchanged.

Cases of Double-bucketing

Once we understood that double-bucketing was causing these unexpected results, we started digging into what cases led to double-bucketing of individual browsers. We found two main cases. Since conversion rates were being affected, unsurprisingly both cases involved checkout.

Checkout from new device

When browsing etsy.com while signed out, you can add listings to your cart.

Once you click the “Proceed to checkout” button, you are prompted to sign in. You get a sign in screen similar to this.

After you sign in, if we have never seen your browser before, then we email you a security alert that you’ve been signed in from a new device. This is a wise security practice and pretty standard across the internet.

Many years ago, we were doing A/B testing on emails which were all sent from offline jobs. Gearman is our framework for running offline jobs based on http://gearman.org. In Gearman, we have no access to cookies and thus cannot get the browser id, but we do have the email address. So override logic was added deep in email template logic to bucket by email address rather than by browser id.

This worked perfectly. But the security email isn’t sent from Gearman; it is coming from the sign in request. So now our bucketing for the same browser id has this different bucketing based on email address rather than browser id.

This worked perfectly for A/B testing in emails sent by Gearman, but the logic applied to all emails, not just those sent by Gearman. Even though the security email is sent by the sign in request (not Gearman), the logic updated the bucketing ID to be the user’s email address rather than the browser id so that the browser might be bucketed into two different variants (once using the browser id and once using the email address).

Since we are no longer using that email system for A/B testing, we were able to simply remove the override call.

Pattern Checkout

Pattern is Etsy’s tool that sellers use to create personalized, separate websites for their businesses.  Pattern shops allow listings to be added to your cart while on the shop’s patternbyetsy.com domain.

The checkout occurs on etsy.com domain instead of the patternbyetsy.com domain. Since the value from the user’s browser cookie is what we bucket on and we cannot share cookies across domains, we have two different hashes used for bucketing.

In order to attribute conversions to Pattern, we have logic to override the browser id with the value from the patternbyetsy.com cookie during the checkout process on etsy.com. This override logic works for attributing conversions; however during sign in some bucketing happens prior to the execution of the override logic by the controllers.

For this case, we chose to remove bucketing data for Pattern visits as this override caused the bucketing logic to put the same user into both the control and treatment variants.

Conclusions

Here is a dashboard of double-bucketed browsers per day that helped us track our fixes of double-bucketing.

No Comments

Capacity planning for Etsy’s web and API clusters

Posted by on October 23, 2018 / 1 Comment

Capacity planning for the web and API clusters powering etsy.com has historically been a once per year event for us.The purpose is to gain an understanding of the capacity of the heterogeneous mix of hardware in our datacenters that make up the clusters. This was usually done a couple of weeks before the time we call Slush. Slush (a word play on code freeze) is the time range of approximately 2 months around the holidays where we deliberately decide to slow down our rate of change on the site but not actually stop all development. We do this to recognize the fact that the time leading up to the holidays is the busiest and most important time for a lot of our sellers and any breakage results in higher than usual impact on their revenue. This also means it’s the most important time for us to get capacity estimates right and make sure we are in a good place to serve traffic throughout the busy holiday season.

During this exercise of forecasting and planning capacity someone would collect all relevant core metrics (the most important one being requests per second on our Apache httpd infrastructure) from our Ganglia instances and export them to csv. Those timeseries data would then be imported into something that would give us a forecasting model. Excel, R, and python scripts are examples of tools that have been used in previous years for this exercise.

After a reorg of our systems engineering department in 2017, Slush that year was the first time the newly formed Compute team was tasked with capacity planning for our web and api tiers. And as we set out to do this, we had three goals:

First we started with a spreadsheet to track everything that we would be capacity planning for. Then we got an overview of what we had in terms of hardware serving those tiers. We got this from running a knife search like this:

knife search node "roles:WebBase" -a cpu.0.model_name -a cpu.cores -F json

and turning it into CSV via a ruby script, so we could have it in the spreadsheet as well. Now that we had the hardware distribution of our clusters, we gave each model a score so we could rank them and derive performance differences and loadbalancer weighting scores if needed. These performance scores are a rough heuristic to allow relative comparison of different CPUs. It takes into account core count, clock speed and generational improvements (assuming a 20% improvement between processors for the same clock speed and core count). It’s not an exact science at this point but a good enough measure to get us a useful idea of how to compare different hardware generations against each other. Then we assigned each server a performance score, based on that heuristic.

Next up was the so called “squeeze testing”. The performance scores weren’t particularly helpful without knowing what they mean in actual work a server with that score can do on different cluster types. Request work on our frontend web servers is very different than the work on our component api tier for example. So a performance score of 50 means something very different depending on which cluster we are talking about.

Squeeze testing is the capacity planning exercise of trying to see how much performance you can squeeze out of a service, usually by gradually increasing the amount of traffic it receives and how much it can handle before exhausting its resources. In the scenario of an established cluster this is often hard to do as we can’t arbitrarily add more traffic to the site. That’s why we turned the opposite dial and removed resources (i.e. servers) from a cluster until the cluster (almost) started to not serve in an appropriate manner anymore.

So for our web and api clusters this meant removing nodes from the serving pools until they drop to about 25% idle CPU and noting the number of requests per second they are serving at this point. 20% idle CPU is a threshold on those tiers where we start to see performance decrease due to the rest of the CPU time being used for tasks like context switching and other non application workloads. That means stopping at 25% gives us headroom for some variance in this type of testing and also means we weren’t hurting actual site performance while doing the squeeze testing.

Now that we got the number of requests per second we could process based on the performance score, the only thing we were missing was knowing how much traffic we expect to see in the coming months. This meant in the past – as mentioned above – that we would download timeseries data from Ganglia for requests per second for each cluster for every datacenter we had nodes in. Then that data needed to be combined to get the total sum of requests we have been serving. Then we would take that data and stick it into Excel and try a couple of Excel’s curve fitting algorithms, see which looked best and take the forecasting results based on fit. We have also used R or python for that task in previous years. But it was always a very handcrafted and manual process.

So this time around we wrote a tool to do all this work for us called “Ausblick”. It’s based on Facebook’s prophet and automatically pulls in data from Ganglia based on host and metric regexes, combines the data for each datacenter and then runs forecasting on the timeseries and shows us a nice plot for it. We can also give it a base value and list of hosts with perfscores and ausblick will draw the current capacity of the cluster into the plot as a horizontal red line. Ausblick runs in our Kubernetes cluster and all interactions with the tool are happening through its REST API and an example request looks like this:

% cat conapi.json
{ "title": "conapi cluster",
  "hostregex": "^conapi*",
  "metricsregex": "^apache_requests_per_second",
  "datacenters": ["dc1","dc2"],
  "rpsscore": 9.5,
  "hosts": [
    ["conapi-server01.dc1.etsy.com", 46.4],
    ["conapi-server02.dc1.etsy.com", 46.4],
    ["conapi-server03.dc2.etsy.com", 46.4],
    ["conapi-server04.dc1.etsy.com", 27.6],
    ["conapi-server05.dc2.etsy.com", 46.4],
    ["conapi-server06.dc2.etsy.com", 27.6],
    ["conapi-server06.dc1.etsy.com", 46.4]
  ]
}
% curl -X POST http://ausblick.etsycorp.com/plan -d @conapi.json --header "Content-Type: application/json"
{"plot_url": "/static/conapi_cluster.png"}%

In addition to this API we wrote an integration for our Slack bot to easily generate a new forecast based on current data.

Ausblick Slack integration

Ausblick Slack integration

And to finish this off with a bunch of graphs, here is what the current forecasting looks like for some of our internal api tiers, that are backing etsy.com:

Ausblick forecast for conapi cluster

Ausblick forecast for conapi cluster

Ausblick forecast for compapi cluster

Ausblick forecast for compapi cluster

Ausblick has allowed us to democratize the process of capacity forecasting to a large extent and given us the ability to redo forecasting estimates at will. We used this process successfully for last year’s Slush and are in the process of adapting it to our cloud infrastructure after our recent migration of the main etsy.com components to GCP.

1 Comment

Etsy’s experiment with immutable documentation

Posted by on October 10, 2018 / 9 Comments

Introduction

Writing documentation is like trying to hit a moving target. The way a system works changes constantly, so as soon as you write a piece of documentation for it, it starts to get stale. And the systems that need docs the most are the ones being actively used and worked on, which are changing the fastest. So the most important docs go stale the fastest! 1

Etsy has been experimenting with a radical new approach: immutable documentation.

Woah, you just got finished talking about how documentation goes stale! So doesn’t that mean you have to update it all the time? How could you make documentation read-only?

How docs go stale

Let’s back up for a sec. When a bit of a documentation page becomes outdated or incorrect, it typically doesn’t invalidate the entire doc (unless the system itself is deprecated). It’s just a part of the doc with a code snippet, say, which is maybe using an outdated syntax for an API.

For example, we have a command-line tool called dbconnectthat lets us query the dev and prod databases from our VMs. Our internal wiki has a doc page that discusses various tools that we use to query the dbs. The part that discusses ‘dbconnect’ goes something like:

 

Querying the database via dbconnect ...

((section 1))
dbconnect is a script to connect to our databases and query them. [...]

((section 2))
The syntax is:

% dbconnect <shard>

 

Section 1 gives context about dbconnect and why it exists, and section 2 gives tactical details of how to use it.

Now say a switch is added so that dbconnect --dev <shard> queries the dev db, and dbconnect --prod <shard> queries the prod db. Section 2 above now needs to be updated, because it’s using outdated syntax for the dbconnect command. But the contextual description in section 1 is still completely valid. So this doc page is now technically stale as a whole because of section 2, but the narrative in section 1 is still very helpful!

In other words, the parts of the doc that’s most likely to go stale are the tactical, operational details of the system. How to use the system is constantly changing. But the narrative of why the system exists and the context around it is less likely to change quite so quickly.

 

How to use the system is constantly changing. But the narrative of why the system exists and the context around it is less likely to change quite so quickly.

 

Docs can be separated into how-docs and why-docs

Put another way: ‘code tells how, docs tell why’  2. Code is constantly changing, so the more code you put into your docs, the faster they’ll go stale. To codify this further, let’s use the term “how-doc” for operational details like code snippets, and “why-doc” for narrative, contextual descriptions  3. We can mitigate staleness by limiting the amount we mix the how-docs with the why-docs.

 

We can mitigate staleness by limiting the amount we mix the how-docs with the why-docs.

 

Documenting a command using Etsy’s FYI system

At Etsy we’ve developed a system for adding how-docs directly from Slack. It’s called “FYI”. The purpose of FYI is to make documenting tactical details — commands to run, syntax details, little helpful tidbits — as frictionless as possible.

 

FYI is a system for adding how-docs directly from Slack.

 

Here’s how we’d approach documenting dbconnect using FYIs 4:

Kaley was searching the wiki for how to connect to the dbs from her VM, to no avail. So she asks about it in a Slack channel:

hey @here anyone remember how to connect to the dbs in dev? I forget how. It’s something like dbconnect etsy_shard_001A but that’s not working

When she finds the answer, she adds an FYI using the ?fyi command (using our irccat integration in Slack 5):

?fyi connect to dbs with `dbconnect etsy_shard_000_A` (replace `000` with the shard number). `A` or `B` is the side

Jason sees Kaley add the FYI and mentions you can also use dbconnect to list the databases:

you can also do `dbconnect -l` to get a list of all DBs/shards/etc, and it works for dev-proxy on or off

Kaley then adds the :fyi: Slack reaction (reacji) to his comment to save it as an FYI:

you can also do `dbconnect -l` to get a list of all DBs/shards/etc, and it works for dev-proxy on or off

A few weeks later, Paul-Jean uses the FYI query command ?how to search for info on connecting to the databases, and finds Kaley’s FYI 6:

?how database connect

He then looks up FYIs mentioning dbconnect specifically to discover Jason’s follow-up comment:

?how dbconnect

But he notices that the dbconnect command has been changed since Jason’s FYI was added: there is now a switch to specify whether you want dev or prod databases. So he adds another FYI to supplement Jason’s:

?fyi to get a list of all DBs/shards/etc in dev, use `dbconnect --dev`, and to list prod DBs, use `dbconnect --prod` (default)

Now ?how dbconnect returns Paul-Jean’s FYI first, and Jason’s second:

?how dbconnect

FYIs trade completeness for freshness

Whenever you do a ?how query, matching FYIs are always returned most recent first. So you can always update how-docs for dbconnect by adding an FYI with the keyword “dbconnect” in it. This is crucial, because it means the freshest docs always rise to the top of search results.

FYIs are immutable, so Paul-Jean doesn’t have to worry about changing any FYIs created by Jason. He just adds them as he thinks of them, and the timestamps determine the priority of the results. How-docs change so quickly, it’s easier to just replace them than try to edit them. So they might as well be immutable.

 

How-docs change so quickly, it’s easier to just replace them than try to edit them. So they might as well be immutable.

 

Since every FYI has an explicit timestamp, it’s easy to gauge how current they are relative to API versions, OS updates, and other internal milestones. How-docs are inherently stale, so they might as well have a timestamp showing exactly how stale they are.

 

How-docs are inherently stale, so they might as well have a timestamp showing exactly how stale they are.

 

The tradeoff is that FYIs are just short snippets. There’s no room in an FYI to add much context. In other words, FYIs mitigate staleness by trading completeness for freshness.

 

FYIs mitigate staleness by trading completeness for freshness

 

Since FYIs lack context, there’s still a need for why-docs (eg a wiki page) about connecting to dev/prod dbs, which mentions the dbconnect  command along with other relevant resources. But if the how-docs are largely left in FYIs, those why-docs are less likely to go stale.

So FYIs allow us to decouple how-docs from why-docs. The tactical details are probably what you want in a hurry. The narrative around them is something you sit back and read on a wiki page.

 

FYIs allow us to decouple how-docs from why-docs

What FYIs are

To summarize, FYIs are:

What FYIs are NOT

Similarly, FYIs are NOT:

Conclusions

Etsy has recognized that technical documentation is a mixture of two distinct types: a narrative that explains why a system exists (“why-docs”), and operational details that describe how to use the system (“how-docs”). In trying to overcome the problem of staleness, the crucial observation is that how-docs typically change faster than why-docs do. Therefore the more how-docs are mixed in with why-docs in a doc page, the more likely the page is to go stale.

We’ve leveraged this observation by creating an entirely separate system to hold our how-docs. The FYI system simply allows us to save Slack messages to a persistent data store. When someone posts a useful bit of documentation in a Slack channel, we tag it with the :fyi: reacji to save it as a how-doc. We then search our how-docs directly from Slack using a bot command called ?how.

FYIs are immutable: to update them, we simply add another FYI that is more timely and correct. Since FYIs don’t need to contain narrative, they’re easy to add, and easy to update. The ?how command always returns more recent FYIs first, so fresher matches always have higher priority. In this way, the FYI system combats documentation staleness by trading completeness for freshness.

We believe the separation of operational details from contextual narrative is a useful idea that can be used for documenting all kinds of systems. We’d love to hear how you feel about it! And we’re excited to hear about what tooling you’ve built to make documentation better in your organization. Please get in touch and share what you’ve learned. Documentation is hard! Let’s make it better!

Acknowledgements

The FYI system was designed and implemented by Etsy’s FYI Working Group: Paul-Jean Letourneau, Brad Greenlee, Eleonora Zorzi, Rachel Hsiung, Keyur Govande, and Alec Malstrom. Special thanks to Mike Lang, Rafe Colburn, Sarah Marx, Doug Hudson, and Allison McKnight for their valuable feedback on this post.

References

  1. From “The Golden Rules of Code Documentation”: “It is almost impossible without an extreme amount of discipline, to keep external documentation in-sync with the actual code and/or API.”
  2. Derived from “code tells what, docs tell why” in this HackerNoon post.
  3. The similarity of the terms “how-doc” and “why-doc” to the term here-doc is intentional. For any given command, a here-doc is used to send data into the command in-place, how-docs are a way to document how to use the command, and why-docs are a description of why the command exists to begin with.
  4. You can replicate the FYI system with any method that allows you save Slack messages to a predefined, searchable location. So for example, one could simply install the Reacji Channeler bot, which lets you assign a Slack reacji of your choosing to cause the message to be copied to a given channel. So you could assign an “fyi” reacji to a new channel called “#fyi”, for example. Then to search your FYIs, you would simply go to the #fyi channel and search the messages there using the Slack search box.
  5. When the :fyi: reacji is added to a Slack message (or the ?fyi irccat command is used), an outgoing webhook sends a POST request to irccat.etsy.com with the message details. This triggers a PHP script to save the message text to a SQLite database, and sends an acknowledgement back to the Slack incoming webhook endpoint. The acknowledgement says “OK! Added your FYI”, so the user knows their FYI has been successfully added to the database.
  6. Searching FYIs using the ?how command uses the same architecture as for adding an FYI, except the PHP script queries the SQLite table, which supports full-text search via the FTS plugin.

9 Comments

How Etsy Handles Peeking in A/B Testing

Posted by and on October 3, 2018 / 1 Comment

Etsy relies heavily on experimentation to improve our decision-making process. We leverage our internal A/B testing tool when we launch new features, polish the look and feel of our site, or even make changes to our search and recommendation algorithms. For years, Etsy has prided ourselves on our culture of continuous experimentation. However, as our experimentation platform scales and the velocity of experimentation increases rapidly across the company, we also face a number of new challenges. In this post, we investigate one of these challenges: how to peek at experimental results early in order to increase the velocity of our decision-making without sacrificing the integrity of our results.

The Peeking Problem

In A/B testing, we’re looking to determine if a metric we care about (i.e. percentage of visitors who make a purchase) is different between the control and treatment groups. But when we detect a change in the metric, how do we know if it is real or due to random chance? We can look at the p-value of our statistical test, which indicates the probability we would see the detected difference between groups assuming there is no true difference. When the p-value falls below the significance level threshold we say that the result is statistically significant and we reject the hypothesis that the control and treatment are the same.

So we can just stop the experiment when the hypothesis test for the metric we care about has a p-value of less than 0.05, right? Wrong. In order to draw the strongest conclusions from the p-value in the context of an A/B test, we have to have fixed the sample size of an experiment in advance, and to only make a decision on the p-value once. Peeking at data regularly and stopping an experiment as soon as the p-value dips below 0.05 increases the rate of Type I errors, or false positives, because the false positive of each test compounds increasing the overall probability that you’ll see a false result.

Let’s look at an example to gain a more concrete view of the problem. Suppose we run an experiment where there is no true change between the control and experimental variant and both have a baseline target metric of 50%. If we are using a significance level of 0.1 and there is no peeking, in other words, the sample size needed before a decision is made is determined in advance, then the rate of false positives is 10%. However, if we do peek and we check the significance level at every observation, then after 500 observations, there is over a 50% chance of incorrectly stating that treatment is different than the control (Figure 1).

Figure 1: Chances for accepting that A and B are different, with A and B both converting at 50%.

At this point, you might already have figured that the simplest way to solve the problem would be to fix a sample size in advance and run an experiment until the end before checking the significance level. However, this requires strictly enforced separation between the design and analysis of experiments which can have large repercussions throughout the experimental process.  In early stages of an experiment, we may miss a bug in the set up or with the feature being tested that will invalidate our results later. If we don’t catch these early, it slows down our experimental process unnecessarily, leaving less time for iterations and real site changes. Another issue involved in set up is that it can be difficult to predict the effect size product teams would like to obtain prior to the experiment, which can make it hard to optimize the sample size in advance.  Even assuming we set up our experiment perfectly, there are down the line implications. If an experiment is impacting a metric in a negative way, we want to be aware as soon as possible so we don’t negatively affect our users’ experience. These considerations become even more pronounced when we’re running an experiment on a small population, or in a less trafficked part of the site and it can take months to reach the target sample size.  Across teams, we want to be able to iterate quickly without sacrificing the integrity of our results.

With this in mind, we need to come up with statistical methodology that will give reliable inference while still providing product teams the ability to continuously monitor experiments, especially for our long-running experiments. At Etsy, we tackle this challenge from two sides, user interface and statistical procedures. We made a few user interface changes to our A/B testing tool to prevent our stakeholders from drawing false conclusions, and we implemented a flexible p-value stopping-point in our platform, which takes inspiration from the sequential testing concept in statistics.

It is worth noting that the peeking problem has been studied by many, including industry veterans1, 2, developers of large-scale commercial A/B testing platforms3, 4 and academic researchers5. Moreover, it is hardly a challenge exclusive to A/B testing on the web. The peeking problem has troubled the medical field for a long time; for example, medical scientists could peek at the results and stop a clinical trial early because of initial positive results, leading to flawed interpretations of the data6, 7.

Our Approach

In this section, we dive into the approach that we have designed and adapted to address the peeking problem: transitioning from traditional, fixed-horizon testing to sequential testing, and preventing peeking behaviors through user interface changes.

Sequential Testing with Difference in Converting Visits

Sequential testing, which has been widely used in clinical trials8, 9 and gained recent popularity for web experimentation10 , guarantees that if we end the test when the p-value is below a predefined threshold α , the false positive rate will be no more than α. It does so by computing the probabilities of false-positives at each potential stopping point using dynamic programming, assuming that our test statistic is normally distributed. Since we can compute these probabilities, we can then adjust the test’s p-value threshold, which in turn changes the false-positive chance, at every step so that the total false positive rate is below the threshold that we desire. Therefore, sequential testing enables concluding experiments as soon as the data justifies it, while also keeping our false positive rate in check.

We investigated a few methods including O’Brien-Fleming, Pocock and sequential testing using difference in successful observations. We ultimately settled on the last approach. Using the difference in successful observations, we look at the raw difference in converting visits and stop an experiment when this difference becomes large enough.  The difference threshold is only valid until we reach a total number of converted visits. This method is good for detecting small changes and does so quickly, which makes it most suitable for our needs. Nevertheless, we did consider some cons this method presented as well. Traditional power and significance calculations use proportion of successes whereas looking at difference in converted visits does not take into account total population size.  Because of this, we are more likely to reach the total number of converted visits before we see a large enough difference in converted visits with high baselines target metrics. This means we are more likely to miss a true change in these cases. Furthermore, it requires extra set up when an experiment is not evenly split across variants. We chose to use this method with a few adjustments for these shortcomings so we could increase our speed of detecting real changes between experimental groups.

Our implementation of this method is influenced by the approach Evan Miller described here. This method sets a threshold for difference between the control and treatment converted visits based on minimal detected effect and target false positive and negative rates.  If the experiment reaches or passes the threshold, we allow the experiment to end early. If this difference is not reached, we assess our results using the standard approach of a power analysis.  The combination of these methods creates a continuous p-value threshold for which we can safely stop an experiment when the p-value is under the curve. This threshold is lower near the beginning of an experiment and converges to our significance level as the experiment reaches our targeted power. This allows us to detect changes quicker with low baselines while not missing smaller changes for experiments with high baseline target metrics.

Figure 2: Example of a p-value threshold curve.

To validate this approach, we tested it on results from experimental simulations with various baselines and effect sizes using mock experimental conditions. Before implementing, we wanted to understand:

  1. What effect will this have on false positive rates?
  2. What effect does early stopping have on reported effect size and confidence intervals?
  3. How much faster will we get a signal for experiments with true changes between groups?

We found that when using a p-value curve tuned for a 5% false positive rate, our early stopping threshold does not materially increase the false positive rate and we can be confident of a directional change.  

One of the downfalls with stopping experiments early, however, is that with an effect size under ~5%, we tend to overestimate the impact and widen the confidence interval.  To accurately attribute increases in metrics to experimental wins, we developed a haircut formula to apply to the effect size in metrics for experiments that we decide to end early.  Furthermore, we offset some of these by setting a standard of running experiments for at least 7 days to account for different weekend and weekday trends.

Figure 3: Reported Vs. True Effect Size

We tested this method with a series of simulations and saw that for experiments which would take 3 weeks to run assuming a standard power analysis, we could save at least a week in most cases where there was a real change between variants.  This helped us feel confident that even with a slight overestimation of effect size, it was worth the time savings for teams with low baselines target metrics who typically struggle with long experimental run times.

Figure 4: Day Savings From Sequential Testing

UI Improvements

In our experimental testing tool, we wanted stakeholders to have access to metrics and calculations we measure throughout the duration of the experiment. In additional to the p-value, we care about power and confidence interval.  First, power.  Teams at Etsy have to often coordinate experiments on the same page so it is important for teams to have an idea of how long an experiment will have to run assuming no early stopping. We do this by running an experiment until we reach a set power.

Second, Confidence interval (CI), is the range of values that are a good estimate of the true value in which we are confident a particular metric falls. In the context of A/B testing for example, if we ran the experiment millions of times, 90% of the time the true value of some effect size would fall within the 90% CI. There are three things that we care most about in relation to the confidence interval of an effect in an experiment:

  1. Whether the CI includes zero, because this maps exactly to the decision we would make with the p-value; if the 90% CI includes zero, then the p-value is greater than 0.1. Conversely, if it doesn’t include zero, then the p-value is less than 0.1;
  2. The smaller the CI, the better estimate of the parameter we have;
  3. The farther away from zero the CI is, the more confident we can be that there is a true difference.

Previously in our A/B testing tool UI, we displayed statistical data as shown in the table below on the left. The “observed” column indicates results for the control and there is a “% Change” column for each treatment variant. When hovering over a number in the “% Change” column, a popover table appears, showing the observed and actual effect size, confidence level, p-value, and number of days we could expect to have enough data to power the experiment based on our expected effect size. 

Figure 5: User interface before changes.

However, always displaying numerical results in the “% Change” column could lead to stakeholders peeking at data and making an incorrect inference about the success of the experiment. Therefore, we added a row in the hover table to show the power of the test (assuming some fixed effect size), and made the following changes to our user interface:

  1. Show a visualization of the C.I. and color the bar red when the C.I. is entirely negative to indicate a significant decrease, green when the C.I. is entirely positive to indicate a significant increase, and grey when the C.I. spans 0.
  2. Display different messages in the “% Change” column and hover table to indicate different stages the experiment metric is currently in, depending on its power, p-value and calculated flexible p-value threshold. In the “% Change” column, possible messages include “Waiting on data”, “Not enough data”, “No change” and “+/- X %” (to show significant increase/ decrease). In the hover table, possible headers include “metric is not powered”, “there is no detectable change”, “we’re confident we detected a change”, and “directional change is correct but magnitude might be inflated” when early stopping is reached but the metric is not powered yet.   

Figure 6: User interface after changes.

Even after making these UI changes, making a decision on when to stop an experiment and whether or not to launch it is not always simple. Generally some things we advise our stakeholders to consider are:

  1. Do we have statistically significant results that support our hypothesis?
  2. Do we have statistically significant results that are positive but aren’t what we anticipated?
  3. If we don’t have enough data yet, can we just keep it running or is it blocking other experiments?
  4. Is there anything broken in the product experience that we want to correct, even if the metrics don’t show anything negative?
  5. If we have enough information on the main metrics overall, do we have enough information to iterate? For example, if we want to look at impact on a particular segment, which could be 50% of the traffic, then we’ll need to run the experiment twice as long as we had to in order to look at the overall impact.

We hope that these UI changes will help our stakeholders make better informed decisions while still letting them uncover cases where they have changed something more dramatically than expected and thus can stop the experiment sooner.

Further Discussion

In this section, we discuss a few more issues we examined while designing Etsy’s solutions to peeking.

Trade-off Between Power and Significance

There is a trade-off between Type I (false positive) and Type II (false negative) errors – if we decrease the probability of one of the errors, the probability of the other will increase – for a more detailed explanation, please see this short post. This translates into a trade-off between p-value and power because if we require stronger evidence to reject the null hypothesis (i.e.  a smaller p-value threshold), then there is a smaller chance that we will be able to correctly reject a false null hypothesis a.k.a decreased power. The different messages we display on the user interface balance this issue to some degree. At the end, it is just a choice that we have to make based on our priorities and focus in experimentation.

Weekend vs. Weekday Data Sample Size

At Etsy, the volume of traffic and intent of visitors varies from weekdays to weekends. This is not a concern for the sequential testing approach that we ultimately chose. However, it would be an issue for some other methods that require equal daily data sample size. During our research, we looked into ways to handle the inconsistency in our daily data sample size. We found that the GroupSeq package in R, which enables the construction of group sequential designs and has various alpha spending functions available to choose among, is a good way to account for this.

Other Types of Designs

The sequential sampling method that we have designed is a straightforward form of a stopping rule modified to best suit our needs and circumstances. However, there are other types of sequential approaches that are more formally defined, such as the Sequential Probability Ratio Test (SPRT), which is utilized by Optimizely’s New Stats Engine4, and the Sequential Generalized Likelihood Ratio test, which has been used in clinical trials11. There has also been debate in both academic and industry about the effectiveness of Bayesian A/B testing in solving the peeking problem2, 5. It is indeed a very interesting problem!

Final Thoughts

Accurate interpretation of statistical data is crucial in making informed decisions about product development. When online experiments have to be run efficiently to save time and cost, we inevitably run into dilemmas unique to our context, and peeking is just one of them. In researching and designing solutions to this problem, we examined some more rigorous theoretical work. However, the characteristics and priorities in online experimentation makes the application of it difficult. Our approach outlined in this post, even though simple, addresses the root cause of the peeking problem effectively. Looking forward, we think the balance between statistical rigorousness and practical constraints is what makes online experimentation intriguing and fun to work on, and we at Etsy are very excited about tackling more interesting problems awaiting us.

This work is a collaboration between Callie McRee and Kelly Shen from the Analytics and Analytics Engineering teams. We would like to thank Gerald van den Berg, Emily Robinson, Evan D’Agostini, Anastasia Erbe, Mossab Alsadig, Lushi Li, Allison McKnight, Alexandra Pappas, David Schott and Robert Xu for helpful discussions and feedback.

References

  1. How Not to Run an A/B Test by Evan Miller
  2.  Is Bayesian A/B Testing Immune to Peeking? Not Exactly by David Robinson
  3.  Peeking at A/B tests: why it matters, and what to do about it by Johari et al., KDD’17
  4.  The New Stats Engine by Pekelis, et al., Optimizely
  5.  Continuous monitoring of A/B tests without pain: optional stopping in Bayesian testing by Deng, Lu, et al., CEUR’17
  6.  Trial sans Error: How Pharma-Funded Research Cherry-Picks Positive Results by Ben Goldacre of Scientific American, February 13, 2013
  7.  False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant by Simmons, Simonsohn, et al. (2011), Psychological Science, 22
  8. Interim Analyses and Sequential Testing in Clinical Trials by Nicole Solomon, BIOS 790, Duke University
  9. A Pocock approach to sequential meta-analysis of clinical trials by Shuster, J. J., & Neu, J. (2013), Research Synthesis Methods, 4(3), 10.1002/jrsm.1088
  10.  Simple Sequential A/B Testing by Evan Miller
  11.  Sequential Generalized Likelihood Ratio Tests for Vaccine Safety Evaluation by Shih, M.-C., Lai, T. L., Heyse, J. F. and Chen, J. (2010), Statistics in Medicine, 29: 2698-2708

1 Comment