Etsy’s Journey to TypeScript

Posted by on November 8, 2021 / No Responses

Over the past few years, Etsy’s Web Platform team has spent a lot of time bringing our frontend code up to date. It was only a year and a half ago that we modernized our Javascript build system in order to enable advanced features, things like arrow functions and classes, that have been added to the language since 2015. And while this upgrade meant that we had futureproofed our codebase and could write more idiomatic and scalable Javascript, we knew that we still had room for improvement.

Etsy has been around for over sixteen years. Naturally, our codebase has become quite large; our monorepo has over seventeen thousand JavaScript files in it, spanning many iterations of the site. It can be hard for a developer working in our codebase to know what parts are still considered best practice, and which parts follow legacy patterns or are considered technical debt. The JavaScript language itself complicates this sort of problem even further — in spite of the new syntax features added to the language over the past few years, JavaScript is very flexible and comes with few enforceable limitations on how it is used. This makes it notoriously challenging to write JavaScript without first researching the implementation details of any dependencies you use. While documentation can help alleviate this problem somewhat, it can only do so much to prevent a JavaScript library from being used improperly, which can ultimately lead to unreliable code.

All of these problems (and many more!) were ones that we felt TypeScript might be able to solve for us. TypeScript bills itself as a “superset of Javascript.” In other words,  TypeScript is everything in Javascript with the optional addition of types. Types, in programming, are basically ways to declare expectations about the data that moves through code: what kinds of input can be used by a function, what sorts of values a variable can hold. (If you’re not familiar with the concept of types, TypeScript’s handbook has a fantastic introduction.) TypeScript is designed to be easily adopted incrementally in existing Javascript projects, particularly in large codebases where shifting to a new language can be an impossibility. It is exceptionally good at inferring types from the code you’ve already written, and it has a type syntax nuanced enough to properly describe all of the quirks that are common in Javascript. Plus, it’s developed by Microsoft, it’s already in use at companies like Slack and AirBnB, and is by far the most used and loved flavor of Javascript according to last year’s “State of JS” survey. If we were going to use types to bring some amount of order to our codebase, TypeScript seemed like a really solid bet.

This is why, hot on the heels of a migration to ES6, we started investigating a path to adopting TypeScript. This post is all about how we designed our approach, some of the fun technical challenges that resulted, and what it took to educate an Etsy-sized company in a new programming language.

Adopting TypeScript, at a high level

I don’t want to spend a lot of time selling TypeScript to you because there are plenty of other articles and talks that do a really good job of exactly that. Instead, I want to talk about the effort it took to roll out TypeScript support at Etsy, which involved technical implementations beyond turning Javascript into TypeScript. It also included a great deal of planning, educating, and coordinating. In hindsight, all of these ingredients seem obvious, but getting the details right surfaced a bunch of learning experiences worth sharing. To start, let’s talk about what we wanted our adoption to look like.

Strategies for adoption

TypeScript can be more or less “strict” about checking the types in your codebase. To quote the TypeScript handbook, a stricter TypeScript configuration “results in stronger guarantees of program correctness.” You can adopt TypeScript’s syntax and its strictness as incrementally as you’d like, by design. This feature makes it possible to add TypeScript to all sorts of codebases, but it also makes “migrating a file to TypeScript” a bit of a loosely-defined target. Many files need to be annotated with types in order for TypeScript to fully understand them. There are also plenty of Javascript files that can be turned into valid TypeScript just by changing their extension from .js to .ts. However, even if TypeScript understands a file just fine, that file might benefit from even more specific types which could improve its usefulness to other developers.

There are countless articles from companies of all sizes on their approach to migrating to TypeScript, and all of them provide compelling arguments for different migration strategies. For example, AirBnB automated as much of their migration as possible. Other companies enabled less-strict TypeScript across their projects, adding types to code over time.

Deciding the right approach for Etsy meant answering a few questions about our migration:

We decided that strictness was a priority; it is a lot of effort to adopt a new language, and if we’re using TypeScript, we may as well take full advantage of its type system (plus, TypeScript’s checker performs better with stricter types). We also knew that Etsy’s codebase is quite large; migrating every single file was probably not a good use of our time, but ensuring we had types for new and frequently updated parts of our site was important. And of course, we wanted our types to be as helpful and as easy to use as possible.

What we went with

Our adoption strategy looked like this:

  1. Make TypeScript as strict as reasonably possible, and migrate the codebase file-by-file.
  2. Add really good types and really good supporting documentation to all of the utilities, components, and tools that product developers use regularly.
  3. Spend time teaching engineers about TypeScript, and enable TypeScript syntax team by team.

Let’s look at each of these points a little more closely.

Gradually migrate to strict TypeScript

Strict TypeScript can prevent a lot of very common errors, so we figured being as strict as possible made the most sense. The downside of this decision is that most of our existing Javascript would need type annotations. It also required us to migrate our codebase file-by-file. If we had tried to convert everything all at once using strict TypeScript, we would have ended up with a lengthy backlog of issues to be addressed. As I mentioned before, our monorepo has over seventeen thousand Javascript files in it, many of which don’t change very often. We chose to focus our efforts on typing actively-developed areas of the site, clearly delineating between which files had reliable types and which ones didn’t using .js and .ts file extensions respectively.

Migrating all at once could make improving existing types logistically difficult, especially in a monorepo. If you import a TypeScript file with an existing suppressed type error in it, should you fix the error? Does it mean that your file’s types need to be different to accommodate potential problems from this dependency? Who owns that dependency, and is it safe to edit? As our team has learned, every bit of ambiguity that can be removed enables an engineer to make an improvement on their own. With an incremental migration, any file ending in .ts or .tsx can be trusted to have reliable types.

Make sure utilities and tools have good TypeScript support

Before our engineers started writing TypeScript, we wanted all of our tooling to support TypeScript and all of our core libraries to have usable, well-defined types. Using an untyped dependency in a TypeScript file can make code hard to work with and can introduce type errors; while TypeScript will try its best to infer the types in a non-TypeScript file, it defaults to “any” if it can’t. Put another way, if an engineer is taking the time to write TypeScript, they should be able to trust that the language is catching their type errors as they write code. Plus, forcing engineers to write types for common utilities while they’re trying to learn a new language and keep up with their team’s roadmap is a good way to get people to resent TypeScript. This was not a trivial amount of work, but it paid off massively. I’ll get into the details about this in the “Technical Details” section below.

Educate and onboard engineers team by team

We spent quite a bit of time on education around TypeScript, and that is the single best decision we made during our migration. Etsy has several hundred engineers, and very few of them had TypeScript experience before this migration (myself included). We realized that, in order for our migration to be successful, people would have to learn how to use TypeScript first. Turning on the switch and telling everyone to have at it would probably leave people confused, overwhelm our team with questions, and hurt the velocity of our product engineers. By onboarding teams gradually, we could work to refine our tooling and educational materials. It also meant that no engineer could write TypeScript without their teammates being able to review their code. Gradual onboarding gave our engineers time to learn TypeScript and to factor it into their roadmaps.

Technical details (the fun stuff)

There were plenty of fun technical challenges during the migration. Surprisingly, the easiest part of adopting TypeScript was adding support for it to our build process. I won’t go into a ton of details about this because build systems come in many different flavors, but in short:

All of the above took a week or two, and most of that time was spent validating that TypeScript we sent to production didn’t behave strangely. The rest of our tooling around TypeScript took much more time and turned out to be a lot more interesting.

Improving type specificity with typescript-eslint

At Etsy, we make heavy use of custom ESLint linting rules. They catch all sorts of bad patterns for us, help us deprecate old code, and keep our pull request comments on-topic and free of nitpicks. If it’s important, we try to write a lint rule for it. One place that we found an opportunity for linting was in enforcing type specificity, which I generally use to mean “how accurately a type fits the thing it is describing”. 

For example, imagine a function that takes in the name of an HTML tag and returns an HTML element. The function could accept any old string as an argument, but if it uses that string to create an element, then it would be nice to make sure that the string was in fact the name of a real HTML element.

// This function type-checks, but I could pass in literally any string in as an argument.
function makeElement(tagName: string): HTMLElement {
   return document.createElement(tagName);
}

// This throws a DOMException at runtime
makeElement("literally anything at all");

If we put in a little effort to make our types more specific, it’ll be a lot easier for other developers to use our function properly. 

// This function makes sure that I pass in a valid HTML tag name as an argument.
// It makes sure that ‘tagName’ is one of the keys in 
// HTMLElementTagNameMap, a built-in type where the keys are tag names 
// and the values are the types of elements.
function makeElement(tagName: keyof HTMLElementTagNameMap): HTMLElement {
   return document.createElement(tagName);
}

// This is now a type error.
makeElement("literally anything at all");

// But this isn't. Excellent!
makeElement("canvas");

Moving to TypeScript meant that we had a lot of new practices we needed to think about and lint for. The typescript-eslint project provided us with a handful of TypeScript-specific rules to take advantage of. For instance, the ban-types rule let us warn against using the generic Element type in favor of the more specific HTMLElement type.

We also made the (somewhat controversial) decision not to allow non-null assertions and type assertions in our codebase. The former allows a developer to tell TypeScript that something isn’t null when TypeScript thinks it might be, and the latter allows a developer to treat something as whatever type they choose.

// This is a constant that might be ‘null’.
const maybeHello = Math.random() > 0.5 ? "hello" : null;

// The `!` below is a non-null assertion. 
// This code type-checks, but fails at runtime.
const yellingHello = maybeHello!.toUpperCase()
 
// This is a type assertion.
const x = {} as { foo: number };
// This code type-checks, but fails at runtime.
x.foo;

Both of these syntax features allow a developer to override TypeScript’s understanding about the type of something. In many cases, they both imply a deeper problem with a type that probably needs to be fixed. By doing away with them, we force our types to be more specific about what they’re describing. For instance, you might be able to use “as” to turn an Element into an HTMLElement, but you likely meant to use an HTMLElement in the first place. TypeScript itself has no way of disabling these language features, but linting allows us to identify them and keep them from being deployed. 

Linting is really useful as a tool to deter folks from bad patterns, but that doesn’t mean that these patterns are universally bad — with every rule, there are exceptions. The nice thing about linting is that it provides a reasonable escape hatch. Should we really, really need to use “as”, we can always add a one-off linting exception.

// NOTE: I promise there is a very good reason for us to use `as` here.
// eslint-disable-next-line @typescript-eslint/consistent-type-assertions
const x = {} as { foo: number };

Adding types to our API

We wanted our developers to write effective TypeScript code, so we needed to make sure that we provided types for as much of the development environment as possible. At first glance, this meant adding types to our reusable design components, helper utilities, and other shared code. But ideally any data a developer might need to access should come with its own types. Almost all of the data on our site passes through the Etsy API, so if we could provide types there, we’d get coverage for much of our codebase very quickly. 

Etsy’s API is implemented in PHP, and we generate both PHP and Javascript configurations for each endpoint to help simplify the process of making a request. In Javascript, we use a light wrapper around the fetch API called EtsyFetch to help facilitate these requests. That all looks loosely like this:

// This function is generated automatically.
function getListingsForShop(shopId, optionalParams = {}) {
   return {
       url: `apiv3/Shop/${shopId}/getLitings`,
       optionalParams,
   };
}

// This is our fetch() wrapper, albeit very simplified.
function EtsyFetch(config) {
   const init = configToFetchInit(config);
   return fetch(config.url, init);
}
 
// Here's what a request might look like (ignoring any API error handling).
const shopId = 8675309;
EtsyFetch(getListingsForShop(shopId))
   .then((response) => response.json())
   .then((data) => {
       alert(data.listings.map(({ id }) => id));
   });

This sort of pattern is common throughout our codebase. If we didn’t generate types for our API responses, developers would have to write them all out by hand and hope that they stayed in sync with the actual API. We wanted strict types, but we also didn’t want our developers to have to bend over backwards to get them.

We ended up taking advantage of some work that our own developer API uses to turn our endpoints into OpenAPI specifications. OpenAPI specs are standardized ways to describe API endpoints in a format like JSON. While our developer API used these specs to generate public-facing documentation, we could also take advantage of them to generate TypeScript types for our API’s responses. We spent a lot of time writing and refining an OpenAPI spec generator that would work across all of our internal API endpoints, and then used a library called openapi-typescript to turn those specs into TypeScript types.
Once we had TypeScript types generated for all of our endpoints, we still needed to get them into the codebase in a usable way. We decided to weave the generated response types into our generated configs, and then to update EtsyFetch to use these types in the Promise that it returned. Putting all of that together looks roughly like this:

// These types are globally available:
interface EtsyConfig<JSONType> {
   url: string;
}
 
interface TypedResponse<JSONType> extends Response {
   json(): Promise<JSONType>;
}
 
// This is roughly what a generated API config file looks like:
import OASGeneratedTypes from "api/oasGeneratedTypes";
type JSONResponseType = OASGeneratedTypes["getListingsForShop"];
 
function getListingsForShop(shopId): EtsyConfig<JSONResponseType> {
   return {
       url: `apiv3/Shop/${shopId}/getListings`,
   };
}
 
// This is (looooosely) what EtsyFetch looks like:
function EtsyFetch<JSONType>(config: EtsyConfig<JSONType>) {
   const init = configToFetchInit(config);
   const response: Promise<TypedResponse<JSONType>> = fetch(config.url, init);
   return response;
}
 
// And this is what our product code looks like:
EtsyFetch(getListingsForShop(shopId))
   .then((response) => response.json())
   .then((data) => {
       data.listings; // "data" is fully typed using the types from our API
   });

The results of this pattern were hugely helpful. Existing calls to EtsyFetch now had strong types out-of-the-box, no changes necessary. Plus, if we were to update our API in a way that would cause a breaking change in our client-side code, our type-checker would fail and that code would never make it to production. 

Typing our API also opened the door for us to use the API as a single source of truth between the backend and the browser. For instance, if we wanted to make sure we had a flag emoji for all of the locales our API supported, we could enforce that using types:

​​type Locales  OASGeneratedTypes["updateCurrentLocale"]["locales"];
 
const localesToIcons : Record<Locales, string> = {
   "en-us": "🇺🇸",
   "de": "🇩🇪",
   "fr": "🇫🇷",
   "lbn": "🇱🇧",
   //... If a locale is missing here, it would cause a type error.
}

Best of all, none of these features required changes to the workflows of our product engineers. As long as they used a pattern they were already familiar with, they got types for free.

Improving the dev experience by profiling our types 

A big part of rolling out TypeScript was paying really close attention to complaints coming from our engineers. While we were still early in our migration efforts, a few folks mentioned that their editors were sluggish when providing type hints and code completions. Some told us they were waiting almost half a minute for type information to show up when hovering over a variable, for example. This problem was extra confusing considering we could run the typechecker across all of our TS files in well under a minute; surely type information for a single variable shouldn’t be that expensive.

We were lucky enough to get a meeting with some of the maintainers of the TypeScript project. They were interested in seeing TypeScript be successful in a unique codebase like Etsy’s. They were also quite surprised to hear about our editor challenges, and were even more surprised when TypeScript took almost 10 full minutes to check our whole codebase, unmigrated files and all.

After some back-and-forth to make sure we weren’t including more files than we needed, they pointed me to the performance tracing feature that they had just introduced at the time. The trace made it pretty apparent that TypeScript had a problem with one of our types when it tried to typecheck an unmigrated Javascript file. Here is the trace for that file (width here represents time).

As it turned out, we had a circular dependency in our types for an internal utility that helps us create immutable objects. These types had worked flawlessly for all the code we had worked with so far, but had problems with some of its uses in yet-to-be-migrated parts of the codebase that created an infinite type loop. When someone would open a file in those parts of the codebase, or when we ran the typechecker on all of our code, TypeScript would hit the infinite loop, spend a lot of time trying to make sense of that type, and would then give up and log a type error. Fixing that type reduced the time it took to check that one file from almost 46 seconds to less than 1.

This type caused problems in other places as well. After applying the fix, checking the entire codebase took about a third of the time and reduced our memory usage by a whole gig:

If we hadn’t caught this problem, it would have eventually made our tests (and therefore our production deploys) a whole lot slower. It also would have made writing TypeScript really, really unpleasant for everyone.

Education

The biggest hurdle to adopting TypeScript is, without question, getting everyone to learn TypeScript. TypeScript works better the more types there are. If engineers aren’t comfortable writing TypeScript code, fully adopting the language becomes an uphill battle. As I mentioned above, we decided that a team-by-team rollout would be the best way to build some institutional TypeScript muscles.

Groundwork 

We started our rollout by working directly with a small number of teams. We looked for teams that were about to start new projects with relatively flexible deadlines and asked them if they were interested in writing them in TypeScript. While they worked, our only job was to review their pull requests, implement types for modules that they needed, and pair with them as they learned.

During this time, we were able to refine our types and develop documentation tailored specifically to tricky parts of Etsy’s codebase. Because there were only a handful of engineers writing TypeScript, it was easy to get direct feedback from them and untangle issues they ran into quickly. These early teams informed a lot of our linting rules, and they helped make sure our documentation was clear and useful. It also gave us the time we needed to complete some of the technical portions of our migration, like adding types to the API.

Getting teams educated

Once we felt that most of the kinks had been ironed out, we decided to onboard any team that was both interested and ready. In order to prepare teams to write TypeScript, we asked them to first complete some training. We found a course from ExecuteProgram that we thought did a good job of teaching the basics of TypeScript in an interactive and effective way. All members of a team would need to complete this course (or have some amount of equivalent experience) before we considered them ready to onboard. 

To entice people into taking a course on the internet, we worked to get people excited for TypeScript in general. We got in touch with Dan Vanderkam, the author of Effective TypeScript, to see if he’d be interested in giving an internal talk (he said yes, and his talk and book were both superb). Separately, I designed some extremely high-quality virtual badges that we gave to people at the midpoint and at the end of their coursework, just to keep them motivated (and to keep an eye on how quickly people were learning TypeScript).

We then encouraged newly onboarded teams to set some time aside to migrate JS files that their team was responsible for. We found that migrating a file that you’re already familiar with is a great way to learn how to use TypeScript. It’s a direct, hands-on way to work with types that you can then immediately use elsewhere. In fact, we decided against using more sophisticated automatic migration tools (like the one AirBnB wrote) in part because it took away some of these learning opportunities. Plus, an engineer with a little bit of context can migrate a file much more effectively than a script could.

The logistics of a “team-by-team roll-out”

Onboarding teams one at a time meant that we had to prevent individual engineers from writing TypeScript before the rest of their team was ready. This happened more often than you’d think; TypeScript is a really cool language and people were eager to try it out, especially after seeing it being used in corners of our codebase. To prevent this sort of premature adoption, we wrote a simple git commit hook to disallow TypeScript changes from users who weren’t part of a safelist. When a team was ready, we’d simply add them to the safelist.

Separately, we worked hard to develop direct communications with the engineering managers on every team. It’s easy to send out an email to the whole engineering department, but working closely with each manager helped to ensure no one was surprised by our rollout. It also gave us a chance to work through concerns that some teams had, like finding the time to learn a new language. It can be burdensome to mandate a change, especially in a large company, but a small layer of direct communication went a long way (even if it did require a sizable spreadsheet to keep track of all the teams).

Supporting teams after onboarding

Reviewing PRs proved to be a really good way to catch problems early, and it informed a lot of our subsequent linting rules. To help with the migration, we decided to explicitly review every PR with TypeScript in it until the rollout was well underway. We scoped our reviews to the syntax itself and, as we grew, solicited help from engineers who had successfully onboarded. We called the group the TypeScript Advisors, and they became an invaluable source of support to newly-minted TypeScript Engineers.

One of the coolest parts about the rollout was how organically a lot of the learning process happened. Unbeknownst to us, teams held big pairing sessions where they worked through problems or tried to migrate a file together. A few even started book clubs to read through TypeScript books. Migrations like these are certainly a lot of work, but it’s easy to forget how much of that work is done by passionate coworkers and teammates.

Where are we now?

As of earlier this fall, we started requiring all new files to be written in TypeScript. About 25% of our files have types, and that number doesn’t account for deprecated features, internal tools, and dead code. Every team has been successfully onboarded to TypeScript at the time of writing.

“Finishing a migration to TypeScript” isn’t something with a clear definition, especially for a large codebase. While we’ll likely have untyped Javascript files in our repo for a while yet, almost every new feature we ship from here on will be typed. All of that aside, our engineers are already writing and using TypeScript effectively, developing their own tooling, starting really thoughtful conversations about types, and sharing articles and patterns that they found useful. It’s hard to know for sure, but people seem to be enjoying a language that almost no one had experience with this time last year. To us, that feels like a successful migration.

Appendix: A list of learning resources

If our adventure with TypeScript has left you interested in adopting it in your own codebase, here are some resources that I found useful.

No Comments

Mobius: Adopting JSX While Prioritizing User Experience

Posted by on November 4, 2021 / 2 Comments

Keeping up with the latest developments in frontend architecture can be challenging, especially with a codebase the size and age of Etsy’s. It’s difficult to balance the benefits of adopting newer tools with the cost of migrating to them. When a total rewrite isn’t possible, how can you ensure that old patterns can coexist with new ones, without totally sacrificing user and developer experience, or loading so much additional code that it harms frontend performance?

This frontend architecture dilemma is one we’ve really wrestled with on Etsy’s Frontend Systems team. Our team is responsible for setting frontend best practices for our product engineers and building infrastructure to support them, and our primary focus for the past few years has been developing a new approach to building Javascript-driven user interfaces on Etsy.

We’ve developed a frontend architecture that we think is an effective, pragmatic compromise. We render static HTML using PHP by default, while giving developers the ability to opt into using “component islands”: self-contained, server-side rendered JSX components used only where complex interactivity is needed.

We named this new architecture Mobius, and the rest of this post will be an overview of what we built (and why!), and how it works.

Javascript at Etsy, pre-Mobius

This work began in 2018, when we saw increasing evidence that the frontend development patterns recommended for buyer-facing website features were no longer meeting product engineers’ needs. This concern was specific to buyer features because, historically, Etsy has used very different frontend approaches across the site.

Our marketplace is two-sided—we build a product for both buyers and sellers—and our code reflects this divide, even though all of it is built in the same large monorepo. Pages for buyers are generally static and informational, but sellers managing their shops and fulfilling orders need something more like a desktop application, to provide a low-friction experience while handling a lot of often-changing information. React is a perfect fit for building more complex applications like this, so we had already begun using it for our seller tools years before, in 2015. The performance and code complexity tradeoffs of a React single-page app were a good fit there. 

On the buyer side, where initial page load performance, mobile-friendliness and SEO are higher priorities, a single page app was less appropriate. We explicitly asked product engineers to avoid using React in that context, which meant buyer-facing pages had a much more limited set of frontend tools. Older code was written in jQuery, with newer features being built with modern, ES6+ transpiled Javascript. Most of the markup for buyer features was generated on the server with our internally-built PHP templating system, with small amounts of Javascript for adding limited interactivity.

This is still a totally viable approach to web application development: good SEO and web performance, as well as progressive enhancement, are baked in. However, it was starting to cause friction for Etsy product engineers, for a few reasons:

Given all this, being able to use JSX components everywhere in our codebase was one of our most common frontend infrastructure requests from product engineers. It no longer seemed reasonable to stick with our jQuery status quo.

Balancing user experience and developer experience

Supporting JSX components in our buyer frontend made sense, but it wasn’t feasible to rewrite every view in our codebase (we have over two million lines of Javascript code!). Anything we built would have to permanently coexist with our PHP-rendered views. Also, we weren’t happy with the tradeoffs of the “easy” solution of adding client-rendered components:

We dealt with the library-size issue by choosing to render our components with Preact, a React alternative that implements the same API in a much smaller codebase (5kb gzipped, as opposed to ~30kb for React).  We were hesitant about this decision at first because of concerns over long-term API compatibility with React. It felt like a risk to rely on the developers of Preact keeping up with future changes to the React ecosystem, and we worried about being locked into using a less standard library if the implementations drifted apart and our needs changed.

Ultimately we decided the benefits of Preact were worth it. As it happened, even in our large codebase the React compatibility issues turned out to be quite limited in scope. The Preact core team was also extremely responsive to our needs, and their legacy-compat package allowed us to support newer React APIs, like hooks, without rewriting all of our existing code.

Preact ended up working out so well for us that we have migrated our entire codebase to it, including our five-year-old seller tools single-page app. In addition to the performance benefits, Preact’s better backwards compatibility made it easier for us to migrate our React 15 code to Preact v10 than to React 16.

Rendering JSX on the server

Given that most of Etsy’s backend view rendering code is written in PHP, a major architectural problem for adopting server-side rendering was determining how to execute our JS views on the server and get that markup into our PHP view rendering pipeline.

Fortunately for us, we already had a lot of the pieces in place to make this possible. The internal API framework we’ve used for many years relies on curl to make multiple API requests in parallel, and then passes that aggregated data on to our view layer. It doesn’t matter to curl how that HTTP response is generated, so it was possible to also make requests to a Node service to fetch our rendered JSX components as well as our API data without major architectural changes.

Although the idea is straightforward, the implementation was not that easy. Our API request code could handle asynchronous processes, but our view rendering layer assumed synchronous execution, so first it had to be refactored to handle waiting for HTTP rendering requests without blocking. (This was a significant effort, but since it’s very specific to our codebase and this blog post is focused on the frontend architecture, we’ll skip the details here.)

Once we enabled asynchronous rendering on the server, we built a Node service to handle the actual component rendering. Our initial proof of concept used Airbnb’s Hypernova server, and though we eventually replaced it with a homegrown service, we kept the general design approach.

Our PHP views are structured as a tree of individual view classes (not unlike a JSX component tree). Typically the templates for these views use Mustache; our work made it possible for leaf nodes of the view tree to provide a JSX component as a template instead. A PHP view that does so looks something like this:

<?php
class My_View {
  const TEMPLATE = 'path/to/my/component.jsx';

  public function getTemplateData() {
      // this data is passed as props for the initial component render
      return [
          'greeting' => 'Hello Code as Craft!',
          'another-prop' => true,
      ];
  }
}

When the PHP view renderer finds one of these JSX views in a tree, it sends a request to our Node service with the name, asset version and template data of that component, and the service returns the rendered HTML. That HTML is combined with all of the PHP rendered HTML and recursively concatenated to build the final document. 

To provide the right component code to the Node service, we keep a manifest of all server-rendered components and build them into a bundle of those components from our monorepo at deploy time. Each version of this bundle is accessible on our asset hosts, and the Node service fetches the bundles as needed to respond to requests, as well as caching the component class itself in memory to save time on future requests.

Component islands in the client

We needed an architectural pattern in the client that matched our “render mostly static markup in PHP, opt into JSX when you need it” approach. Jason Miller and Martin Hagemeister from the Preact core team met with us to discuss their previous solutions to this problem. During that meeting, Katie Sylor-Miller, our frontend architect, dubbed this pattern “component islands”: floating pieces of interactive, JSX-powered UI within an otherwise static page.

Shared code finds each island on the page and hydrates the component, typically when it is about to be scrolled into the viewport, which saves some time executing JS on the initial page load. Hydrating lets us avoid a redundant render on the client the first time we execute our JSX code there. The client code instead finds the server-generated markup (based on a unique identifier) and initial data (inlined into the document as JSON) and “adopts” it, meaning there’s no additional DOM mutation until an update is triggered by user interaction.

Initially, we assumed islands would be generally isolated from each other and not rely on shared state with other components. As engineers began to build features with Mobius, we realized the reality was more complex, especially when features were being incrementally migrated to JSX. There was a clear need for a standard way for islands to communicate with each other and with preexisting Javascript code we did not want to refactor.

A block wireframe of the Etsy listing page. The content is divided into two main columns. The left column contains blocks that represent the listing image carousel. They are outlined in a purple dashed box labeled "Non-island plain Javascript". In the right column, some of the blocks, representing text and a select input, are outlined in orange and labeled "Preact component island".

We considered writing our own event bus to handle cross-island communication, but ultimately decided to use Redux instead. Initially we were hesitant to adopt Redux, because it seemed like too much overhead for the limited complexity of most component islands. After working with product engineers on some early features using Mobius, however, we realized the benefits of familiarity and consistency with our single-page app (which uses Redux extensively) made it the right choice.

We created a global Redux store to serve as a way to share data across these components. JSX components connect to the store using Redux Toolkit. Plain Javascript code can also subscribe to store changes and treat it as more of an event bus, while still having access to the same data, allowing us to incrementally rewrite features while still integrating with existing jQuery-based components. Reducers can be lazy loaded as well, to continue to keep initial JS execution time low.

Component islands can connect to Redux using standard Redux Toolkit utilities like createSlice and patterns like using Connect components and selectors to provide access to the store.

const counterSlice = createSlice({
  name: "counter-slice",
  initialState: {
    count: 0,
  },
  reducers: {
    increment(state) {
      state.count++;
    }
  },
});

To allow existing code to access the Redux store from outside of a component context, we created helper methods for subscribing to store changes.

ReduxStoreManager.subscribeTo(
  "counter-slice",
  (sliceState, prevSliceState) => {
    const { count } = sliceState;
    console.log(`The count is now ${count}.`);  }
);

// dispatch an action to update the state
ReduxStoreManager.dispatch({ type: "incrementCount" });

Building the best tool for the job

Mobius has been successfully powering features in production on Etsy.com since late 2020. We’re proud of how it has enabled us to improve developer experience and velocity for product engineers without rewriting huge portions of our frontend codebase or having a large negative impact on frontend performance.

Now that we have this fundamental tooling in place, we’ve begun to explore additional ways to use Preact and server-rendered Javascript effectively. Being able to use JSX everywhere will make it possible to soften the boundaries between our seller- and buyer-facing features. We hope to use SSR to improve performance in our shop management application, and to use Preact as the primary renderer for our design system components (since currently we maintain two implementations, JSX and plain Javascript).

While building out this entire infrastructure was a major undertaking for us, given our large codebase and preexisting systems, the general principle of only using complex Javascript in the client where you need it is widely applicable, and one we hope can be more widely adopted without quite as much effort.

JSX is often seen as an all-or-nothing choice, but we built a hybrid architecture that lets us have the best of both worlds. We integrated server-rendered JSX with our existing PHP view framework so product engineers can use whichever tool is best for their use case without sacrificing performance.

This work was truly a collaborative effort between many Etsy engineering teams. I particularly want to acknowledge the contributions of the entire Frontend Systems team past and present (Steven Washington, Miranda Wang, Sarah Wulf, Amit Snyderman, Sean Gaffney), as well as Katie Sylor-Miller. Also thank you again to the Preact core team for the initial guidance towards this architecture and all the continued support!

2 Comments

Improving the Deployment Experience of a Ten-Year Old Application

Posted by on June 15, 2021 / No Responses

In 2018, Etsy migrated its service infrastructure from self-managed data centers to cloud provisioning. (We blogged about it at the time.) The change opened up opportunities for improvements to technical processes across the company. For the Search team, the flexible scaling that comes with a cloud environment allowed us to completely reevaluate a somewhat cumbersome deployment process. Inspired by the existing architectural pattern of canary rollouts, we wrote a new custom tool to supplement our existing deployment infrastructure. 

What we ended up with, after three months of effort, was a more scalable, more developer-friendly, and ultimately a more robust way to roll out changes to Search.

The Blue and the Green

Historically, we deployed our stack on two separate sets of hosts, in what’s known as a blue-green strategy. At any one time, only one set of hosts is live; the other set, or “side,” is dark. Both sides were always fully scaled and ready to serve traffic, but only the live side was accessible to the public internet. 

While simple, a blue-green deployment strategy has some very useful features:

We refer to the two sets of hosts as “flip” and “flop”, named after the circuit that is a fundamental building block of modern computers. We point our monolithic PHP web application to whichever side should be active via some lines of code in a configuration file. 

A diagram of what our infrastructure previously looked like. One side (flop, in this example) was always live, and during a deployment we’d move all traffic at once to the other side (flip in this example).

This blue-green deployment method was “lifted and shifted” during Etsy’s migration to the cloud three years ago. The Search team moved the search application to Google Kubernetes Engine (GKE), and flip and flop became two separate production Kubernetes namespaces. 

That change aside, things worked as they always had: deploying Search instantly triggered a redirect of all traffic from one namespace—the live side—to the same services running in the other namespace. To ensure the dark side would always be ready to go, we continued maintaining 200% capacity at all times (100% in each production namespace), just as we had done when we were on-premises.

This original deployment strategy was immensely useful to the team, especially for giving us a secure place to test and prepare major software updates and infrastructural changes. But it wasn’t without its painful aspects. Abruptly shifting all traffic between sides gave us no room to test changes on small amounts of production traffic before going all in. Even when things went  right, deploying was stressful. And when things went wrong, engineers had to triage to decide whether to fully revert to the former side. On top of all that, having to permanently maintain at double capacity was expensive and inefficient.

Thanks to the flexibility provided by the cloud, once we were safely on GKE we had an opening to rethink our blue-green strategy and address these longstanding issues.

The Canary (Lite)

Our first thought was to adopt a canary deployment strategy. During a “canary rollout”, a small subset of traffic is sent to the new version of a service to determine it is “safe” before switching over all traffic to the new service. 

Why the name? Coal miners used to use canaries to detect carbon monoxide at a level that could hurt a small bird, but not yet hurt a human. Software engineers have adopted a similar—albeit more humane—model to build confidence that new software is safe to serve traffic.

Although Kubernetes’ architecture and flexible scaling mean canary rollouts are a very popular deployment solution, the design of Etsy’s search system meant we couldn’t use any off-the-shelf canary release solutions. We had to build something new for ourselves, a sort of Canary Lite.

We had two key limitations when looking to re-architect our deployment process to incorporate a canary component. 

First, we had no single load balancer or API endpoint where we could control the amount of incoming traffic going to flip versus flop. This made it impossible to do basic canary rollouts using Kubernetes labels on a single Kubernetes deployment for any search service, because Search is made of many disparate Kubernetes deployments. There was no place we could put routing logic to check the labels and route to the canary pods accordingly. 

However, Etsy’s PHP web application is the search application’s only client. This is a common pattern at Etsy, and as a result, configuration load balancing is often managed directly within the web application itself. Any new deployment solution would either have to manage traffic from the web application to Search from within the web application, or implement some sort of entirely new mesh network (like Istio) to catch and direct all traffic from the web application to Search. Neither of these options were viable in the time frame allotted for this project.

The second limitation was that the search application assumes any single web request will be served by the same version of all search services in the request path. As a result, deployment of any new solution would need to ensure that in-flight search requests would finish being served by the old version of all search services. Even sophisticated canary rollout solutions like Istio require your application to handle version mismatches between different services, which we couldn’t guarantee.

So how could we create a gradual rollout for a new version of all search services, while simultaneously managing load-balancing from the web application to all parts of the rollout AND guaranteeing search services only ever talked to other search services of the same version? There were no off-the-shelf solutions that could solve such an Etsy-specific problem. So we built an entirely new tool, called Switchboard.

Enter Switchboard

Switchboard’s primary function is to manage traffic: it rolls a deployment out to production by gradually increasing the percentage served to the new live side, and proportionally decreasing the amount going to the old one. 

Deployment stages with predefined traffic ratios are hardcoded into the system, and when all pods added during the current rollout stage are fully created and healthy, Switchboard transitions to the next stage. It does this by editing and committing the new traffic percentages to a configuration file within the web application. The web app re-checks this file on every new request and uses the information to load balance search traffic between two different production Kubernetes namespaces, both still called flip and flop.

Example of a side switch using Switchboard. Smoke tests are running at 16:57 and 17:07.

Switchboard largely automates the migration of traffic from one search side to the other during a deployment. Smoke tests run at different phases of the deployment, sending both artificially-created and real historical search query requests to the new side. Developers just need to monitor the graphs to make sure the rollout went smoothly. 

The engineer driving the deploy manages Switchboard through a user interface that shows the current state of the rollout and also provides options for pausing or rolling back the deployment.

With Switchboard, we largely rely on Kubernetes’ built-in autoscaling to scale the new cluster during the deployment. We have found that we only need to prescale the cluster to serve 25% of our current capacity before we start sending production traffic to it. Kubernetes’ built-in autoscaling is reactive, and therefore necessarily slower than if we force Search to scale before it needs the extra capacity. As a result, it helps to prescale the new live side so it responds faster to the initial shift as that side goes live and starts to receive traffic. From there, Switchboard lets Kubernetes manage its own autoscaling, simply monitoring the Kubernetes rollout to make sure all services are healthy at the current stage before making the decision to ramp up.

Results

We designed Switchboard to improve the resource consumption of our Search system, and it has done that. But the stepped deployment approach has also resulted in a number of nice workflow improvements for developers. 

Switchboard allows us to keep our overall search VM capacity at or close to 100% rather than the 200% capacity we’d been supporting before. We no longer need to provision double capacity when traffic to Search increases. It’s now much easier to adapt to variations in traffic, since any additional reactive (automatic) or proactive (manual) scaling only needs to reserve compute services for our true capacity instead of double. As a result, there was a noticeable improvement in our cloud VM utilization during the period in which we released Switchboard.

Cloud costs per search request (cloud billing total/number of requests) over several months showing our improved utilization efficiency post-Switchboard.

The second big win from Switchboard is that it has made deploys to our staging environment consistently faster. Our first attempt to move away from the legacy double provisioning approach was to fully scale down the unused search cluster between deploys and and then preemptively rescale it as the first step in the next deploy. One problem with this approach was that developers had to wait for all the services inside our Search system to be scaled up enough to accept traffic before they could test in our staging environment. 

As you can see in the graph below, deploys to staging have become less bursty since we adopted Switchboard. Switchboard’s stepped scaling means we can send staging traffic to the new live side much faster. In the worst-case scenarios, provisioning a completely new fleet of nodes in the cloud was taking 20 minutes or more. That is 20 minutes that a developer needs to wait before being able to see their changes in staging. 

Time elapsed per staging environment deploy. Each individual line is a single deploy.

Overall, after implementing Switchboard we saw similar increased utilization to our intermediate solution, but without having to compromise on slower deploy times. Switchboard even improved on the utilization efficiency of the intermediate solution.

It’s also easier now to spot and respond to issues during a deploy. Search deploys technically take longer than they did when we maintained two fully scaled clusters, but that additional time is caused by the gradualness of the automated traffic rollout process. A human search deployer typically passively monitors rollout stages without interacting at all. But if they need to, they can and will pause a deploy to examine current results. Search deployers use Switchboard at least once a month to pause a rollout. This is an option that simply wasn’t available to us before. Due to Switchboard’s gradual rollouts and its ability to pause, individual deploys have become safer and more reliable.

In the end, rearchitecting our blue-green deployment process to include a canary-like gradual traffic ramp-up via Switchboard allowed us to make our system more scalable and efficient while also designing for a better developer experience. We were able to successfully adapt our search application’s architecture to take advantage of the flexibility of our Kubernetes and cloud environment.

No Comments

Increasing experimentation accuracy and speed by using control variates

At Etsy, we strive to nurture a culture of continuous learning and rapid innovation. To ensure that new products and functionalities built by teams — from polishing the look and feel of our app and website, to improving our search and recommendation algorithms — have a positive impact on Etsy’s business objectives and success metrics, virtually all product launch decisions are vetted based on data collected via carefully crafted experiments, also known as A/B tests.

With hundreds of experiments running every day on limited online traffic, our desire for fast development naturally calls for ways to gain insights as early as possible in the lifetime of each experiment, without sacrificing the scientific validity of our conclusions. Among other motivations, this need drove the new formation of our Online Experimentation Science team: a team made of engineers and statisticians, whose key focus areas include building more advanced and scalable statistical tools for online experiments.

In this article, we share details about our team’s journey to bring the statistical method known as CUPED to Etsy, and how it is now helping other teams make more informed product decisions, as well as shorten the duration of their experiments by up to 20%. We offer some perspectives on what makes such a method possible, what it took us to implement it at scale, and what lessons we have learned along the way.

Is my signal just noise?

In order to fully appreciate the value of a method like CUPED, it helps to understand the key statistical challenges that pertain to A/B testing. Imagine that we have just developed a new algorithm to help users search for items on Etsy, and we would like to assess whether deploying it will increase the fraction of users who end up making a purchase, a metric known as conversion rate.

A/B testing consists in randomly forming 2 groups — A and B — of users, such that users in group A are treated (exposed to the new algorithm, regarded as a treatment) while users in group B are untreated (exposed to the current algorithm). After measuring the conversion rates YA and YB from group A and group B, we can use their difference YA – YB to estimate the effect of the treatment.

There are essentially two facets to our endeavour — detection and attribution. In other words, we are asking ourselves two questions:

Since our estimated difference is based only on a random sample of observations, we have to deal with at least two sources of uncertainty.

The first layer of randomness is introduced by the sampling mechanism. Since we are only using a relatively small subset of the entire population of users, attempting to answer the first question requires the observed difference to be a sufficiently accurate estimator of the unobserved population-wide difference, so that we can distinguish a real effect from a fluke.

The other important layer of randomness comes from the assignment mechanism. Claiming that the effect is caused by the treatment requires groups A and B to be similar in all respects, except for the treatment that each group receives. As an illustrative thought experiment: pretend that we could artificially label each user as either “frequent” or “infrequent” based on how many times they have visited Etsy in the previous month. If, by chance, or rather mischance, a disproportionately large number of “frequent” users were assigned to group A (Figure 1), then it would call into question whether the observed difference in conversion rate is indeed due to an effect from the treatment, or whether it is simply due to the fact that the groups are dissimilar.

Illustration of imbalanced randomization
Figure 1. Two possible realizations of randomly assigning a set of users to either the treated group A or the untreated group B. The colors (blue or green) represent an arbitrary user attribute (e.g. “frequent” or “infrequent”). In scenario (2), an observed difference in conversion rate could be solely due to the dissimilarity of subpopulations between groups A and B, rather than to the existence of a real effect from the treatment. Increasing sample sizes is one possible way to ensure larger chances of forming similar groups and getting closer to the idealized balance pictured in scenario (1).

One solution to the attribution question is to exploit the randomization of the assignments, which guarantees that — except for the treatments received — groups A and B will become more and more similar in every way, on average, as their sample sizes increase. Going one step further, if we somehow understood how the type of a user (e.g. “frequent” or “infrequent”) informs their buying habit, then we could attempt to proactively adjust for group dissimilarities, and correct our naive difference YA – YB by removing the explainable contribution coming from apparent imbalances between groups A and B. This is where CUPED comes into play.

What is CUPED?

CUPED is an acronym for Controlled experiments Using Pre-Experiment Data [1]. It is a method that aims to estimate treatment effects in A/B tests more accurately than simple differences in means. As reviewed in the previous section, we traditionally use the observed difference between the sample means

YA – YB

of two randomly-formed groups A and B to estimate the effect of a treatment on some metric Y of interest (e.g. conversion rate). As hinted earlier, one of the challenges lies in disentangling and quantifying how much of this observed difference is due to a real treatment effect, as opposed to misleading fluctuations due to comparing two subpopulations made of non-identical users. One way to render these latter fluctuations negligible is to increase the number of users in each group. However, the required sample sizes tend to grow proportionally to the variance of the estimator YA – YB, which may be undesirably large in some cases and lead to prohibitively long experiments.

The key idea behind CUPED is not only to play with sample sizes, but also to explain parts of these fluctuations away with the help of other measurable discrepancies between the groups. The CUPED estimator can be written as

YA – YB(XA – XB) β

which corrects the traditional estimator with an additional denoising term. This correction involves the respective sample means (XA and XB) of a well-thought-out vector of user attributes (so-called covariates and symbolized by X), and a vector β of coefficients to be specified. Our earlier example (Figure 1) involved a single binary covariate, but CUPED generalizes the reasoning to multidimensional and continuous variables. Intuitively, the correction term aims to account for how much of the difference in Y is not due to any effect of the treatment, but rather due to differences in other observable attributes (X) of the groups.

By choosing X as a vector of pre-experiment variables (collected prior to each user’s entry into the experiment), we can ensure that the correction term added by CUPED does not introduce any bias. Additionally, the coefficient β can be judiciously optimized so that the variance of the CUPED estimator becomes smaller than the variance of the original estimator, by a reduction factor that relates to the correlation between Y and X. In simple terms, the more information on Y we can obtain from X, the more variance reduction we can achieve with CUPED. In the context of A/B testing, smaller variances imply smaller sample size requirements, hence shorter experiments. Correspondingly, if sample sizes are kept fixed, smaller variances enable larger statistical power for detecting effects (Figure 2).

Illustration of how variance relates to power
Figure 2. Illustration of how variances relate to statistical power. Each group’s density curve represents how measurements for that group are distributed. The centers of the distributions are kept fixed between the top and bottom panels, while the variances in the bottom panel are smaller than in the top panel. By interpreting variances in relation to margins of error for estimating the metric’s mean of each group, we can see that reducing variances helps to detect a shift between the means of the two groups more confidently.

The benefits and accessibility of CUPED (especially its quantifiable improvement over the traditional estimator, its interpretability, and the simplicity of its mathematical derivation) explain its popularity and widespread adoption by other online experimentation platforms [2, 3, 4, 5].

Implementing CUPED at Etsy

The implementation of CUPED at scale required us to construct a brand new pipeline (Figure 3) for data processing and statistical computation. Our pipeline consists of 3 main steps:

  1. Retrieving (or imputing) pre-experiment data for all users.
  2. Computing CUPED estimators for each group.
  3. Performing statistical tests (t-test) using CUPED estimators.
CUPED pipeline diagram
Figure 3. Schematic diagram of our data processing pipeline for CUPED. The inputs to the pipeline consist of data collected during the experiment, along with historical data from users’ past visits. The pipeline outputs CUPED-based estimates and statistical summaries for each metric of interest.

When a user enters into an experiment for the first time, we attempt to fetch the user’s most recent historical data from the preceding few weeks. The window length was chosen to hit a sweet spot between looking far enough back in time for historical data to exist, but not as far as to render such pre-experiment data unpredictive of the in-experiment outcomes. This step involves some careful engineering in order to retrieve (and possibly reconstruct) the historical data from the pre-experiment period at the level of each individual user.

Once the pre-experiment variables are retrieved and formatted, we may proceed with the CUPED adjustments. As it turns out, the optimal choice of coefficient β coincides with the ordinary-least-squares coefficient of a linear regression of Y on X. This relationship enables the efficient computation of CUPED estimators as residuals from linear regressions, which we implemented at scale using Apache Spark’s MLlib [6]. The well-established properties of linear regressions also allowed us to design non-trivial and interpretable simulations for unit testing.

Since CUPED estimators can be expressed as simple differences in means (albeit using adjusted outcomes instead of raw outcomes), we were able to leverage our existing t-testing framework to compute corresponding p-values and confidence intervals. Besides the adjusted difference between the two groups, our pipeline also outputs the group-level estimates YA – (XA – XB) β and YB for further reporting and diagnosis. Note the intentional asymmetry of the expressions, as YA – XA β and YB – XB β would generally be biased estimators of the group-level means, unless the covariates were properly centered.

Impact and food for thought

Overall, CUPED leads to meaningful improvements across our experiments. However, we observe varying degrees of success, e.g. when comparing different types of pages (Figure 4). This can be explained by the fact that different pages may have different amounts of available pre-experiment data, with different degrees of informativeness (e.g. some pages may be more prone to be seen by newer users, on whom we may not have much historical information).

Variance reductions from CUPED
Figure 4. Violin-plot of the reductions in variance achieved by CUPED (higher is better) across experiments, grouped by page types. For a given page type, the height of the violin indicates the range of reductions across experiments on that page, while the width of the violin (at a specific value on the y-axis) conveys the number of experiments (benefitting from that specific value of variance reduction).

In favorable cases, our out-of-the-box CUPED implementation can reduce variances by up to 20%, thus leading to narrower confidence intervals and shorter experiment durations (Figure 5). In more challenging cases where pre-experiment data is largely missing or uninformative, the correction term from CUPED becomes virtually 0, making CUPED estimators revert to their non-CUPED counterparts and hence yield no reduction in variance — but no substantial increase either.

Comparing non-CUPED vs. CUPED
Figure 5. Example of an experiment contrasting non-CUPED (left panel) with CUPED (right panel) results. As intended, CUPED leads to a narrower confidence interval, larger statistical power, and a shorter experiment duration.

On the engineering side of things, one of the lessons we learned from implementing CUPED is the importance of producing and storing experiment data at the appropriate granularity level, so that the retrieval of pre-experiment data can be done efficiently and in a replicable fashion. Scalability also becomes a key desideratum as we expand the application of CUPED to more and more metrics.

Another challenge we overcame was ensuring a smooth delivery of CUPED, both in terms of user experience and communication. To this end, we conducted several user research interviews at different stages of the project, in order to inform our implementation choices and make certain that they aligned with the needs of our partners from the Analytics teams. Integrating new CUPED estimators to Etsy’s existing experimentation platform — and thus discontinuing their long-established non-CUPED counterparts — was done after careful UX and UI considerations, by putting thoughts into the design and following a meticulous schedule of incremental releases. Our team also invested a lot of effort into creating extra documentation and resources to anticipate possible concerns or misconceptions, as well as help users better understand what to expect (and equally importantly, what not to expect) from CUPED.

Finally, from a methodological standpoint, an interesting reflection comes from noticing that CUPED estimators can achieve smaller variances than their non-CUPED counterparts, at essentially no cost in bias. The absence of any bias-variance trade-off may make one feel skeptical of the seemingly one-sided superiority of CUPED, as one may often hear that there is no such thing as a free lunch. However, it is insightful to realize that … this lunch is not free.

In fact, the conceptual dues that we are paying to reap CUPED’s benefits are at least two-fold. First, we are borrowing information from additional data. Although the in-experiment sample size required by CUPED is smaller compared to its non-CUPED rival, the total amount of data effectively used by CUPED (when combining both in-experiment and pre-experiment data) may very well be larger. That cost is somewhat hidden by the fact that we are organically collecting pre-experiment data as a byproduct of our natural experimentation cycle, but it is an important cost to acknowledge nonetheless. Second, CUPED estimators are computationally more expensive than their non-CUPED analogues, since their linear regressions induce additional costs in terms of execution time and algorithmic complexity.

All this to say: the increased accuracy of CUPED is the fruit of sensible efforts that warrant thoughtful considerations (e.g. in the choice of covariates) and realistic expectations (i.e. not every experiment is bound to magically benefit).

We hope that our work on CUPED can serve as an inspiring illustration of the valuable synergy between engineering and statistics.

Acknowledgements

We would like to give our warmest thanks to Alexandra Pappas, MaryKate Guidry, Ercan Yildiz, and Lushi Li for their help and guidance throughout this project.

References

[1] Deng A., Xu Y., Kohavi R., Walker T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data.

[2] Xie H., Aurisset J. (2016). Improving the sensitivity of online controlled experiments: case studies at Netflix.

[3] Jackson S. (2018). How Booking.com increases the power of online experiments with CUPED.

[4] Kohavi R., Tang D., Xu Y. (2020). Trustworthy online controlled experiments: a practical guide to A/B testing.

[5] Li J., Tang Y., Bauman J. (2020). Improving experimental power through control using predictions as covariate.

[6] Apache Spark. MLlib [spark.apache.org/mllib].

2 Comments

How We Built A Context-Specific Bidding System for Etsy Ads

Posted by , and on March 23, 2021 / 1 Comment

Etsy Ads is an on-site advertising product that gives Etsy sellers the opportunity to promote their handcrafted goods to buyers across our website and mobile apps.

Sellers who participate in the Etsy Ads program are acting as bidders in an automated, real-time auction that allows their listings to be highlighted alongside organic search results, when the ad system judges that they’re relevant. Sellers only pay for a promoted listing when a buyer actually clicks on it.

In 2019, when we moved all of our Etsy Ads sellers to automated bid pricing, we knew it was important to have a strong return on the money our sellers spent with us. We are aligned with our sellers in our goal to drive as many converting and profitable clicks as possible. Over the past year, we’ve doubled down on improvements to the bid system, developing a neural-network-powered platform that can determine, at request time, when the system should bid higher or lower on a seller’s behalf to promote their item to a buyer. We call it contextual bidding, and it represents a significant improvement in the flexibility and effectiveness of the automation in Etsy Ads.

First, Some Background

In the past, Etsy Ads gave sellers the option to set static bids for their listings, based on their best estimate of value. That bid price would be used every time one of their listings matched for a search. If a seller only ever wanted to pay 5 cents for an ad click, they could tell us and we would bid 5 cents for them every time.

The majority of our sellers didn’t have strong opinions about per-listing bid prices, and for them we had an algorithmic auto-bidding system to bid on their behalf. But even for sellers who self-priced, most adopted a “set it and forget it” strategy. Their prices never changed, so our system made the same bid for them at the highest-converting times during the weekend as it did on, say, a Tuesday at 4 a.m. Savvier sellers adjusted their bids day-to-day to take advantage of weekly traffic trends. No matter how proactive they were, though, sellers were still limited because they lacked request-time information about searches. And they could only set one bid price across all ad requests.

Two scatter plots display a sample of ads shown to prospective buyers over the course of one day. Each dot represents a single ad shown: blue, if the listing went on to be purchased, red if it did not. The plots are differentiated by whether the buyer and seller are in the same region. The data suggest a higher likelihood of purchase when buyers are in the same region as sellers and a higher rate of purchase after 12pm. A context-specific bidder can use information like this at the time of a search request to adjust bids.

Our data shows that conversion rates differ across many axes: hour-of-day, day-of-week, platform, page type. In other words, context. We migrated our Etsy Ads sellers to algorithmic bid pricing because we wanted to ensure that all of them, regardless of their sophistication with online advertising, had a fair chance to benefit from promoting their goods on Etsy. But to realize that goal, we knew we needed to build a system that could set context-aware bids.

How to Auto-bid

Etsy Ads ranks listings based on their bid price, multiplied by their predicted click-through rate, in what’s known as a second-price auction. This is an auction structure where the winning bid only has to pay $0.01 above what would be needed to outrank the second-highest bidder in the auction.

In a second-price auction, it is in the advertiser’s best interest to bid the highest amount they are willing to pay, knowing that they will often end up paying less than that. This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click.

Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value

To understand the intuition behind this formula, consider a seller who has a listing priced at $100. If, on average, 1 out of every 10 buyers who click on their listing end up purchasing it, the seller could pay $10 per click and (over time) break even. But of course we do not want our sellers just to break even, we want their ads to generate more money in sales than the advertising costs. So the auto-bidder calculates the expected value of an ad click for each listing and then scales those numbers down to generate bid prices. 

Bid = Expected Value of an Ad Click * Scaling Factor

At a high level, we are assigning a value to an ad click based on factors like the amount of sales that click is expected to generate.. 

Obviously, there’s a lot riding on that predicted conversion rate. Even marginal improvement there makes a difference across the whole system. Our old automated bidder was, in a sense, static: it estimated the post-click conversion rate for each listing only once per day. But a contextual bidder, with access to a rich set of attributes for every request, could set bids that were sensitive to how conversion is affected by time of day, day of week, and across platforms. Most importantly, it would be able to take relevance into account, bidding higher or lower based on factors like how the listing ranked for a buyer’s search (all in a way that respects users’ privacy choices.) And it would do all that in real time, with much higher accuracy, thousands of times a second.

Example of a context-specific bid for a “printed towel” search on an iOS device. This listing matched the query, and given its inputs our model predicts that 5 of 1000 clicks should lead to a purchase. With an expected order value of $20, the bid should be around $0.10. The adaptive bid price helps sellers win high purchase-intent traffic and saves budget on low purchase-intent traffic.

Better PC-CVR through Machine Learning

To power our new contextual bidder, we adopted learning-to-rank, a widely applied information-retrieval methodology that uses machine learning to create predictive rankings.

On each searchad request, our system collects the top-N matched candidate listings and predicts a post-click conversion rate for each of them. These predictions are combined with each listing’s expected order value to generate per-listing bids, and the top k results are sorted and returned. Any clicked ads are fed back into our system as training data, with a target label attached that indicates whether or not those ads led to a purchase.

The first version of our new bidding system to reach production used a gradient-boosted decision tree (GBDT) model to do PC-CVR prediction. This model showed substantial improvements over the baseline non-contextual bidding system. 

After we launched with our GBDT-based contextual bidder, we developed an even more accurate model that used a neural network (NN) for PC-CVR prediction. We trained a three-layer fully-connected neural network that used categorical features, numeric features, dense vector embeddings and pre-trained word embeddings fine-tuned to the task. The features are both contextual (e.g. time-of-day) and historical (e.g. 30-day listing purchase rate). 

The figure displays our neural network architecture for a contextual post-click conversion rate prediction model. We use pre-trained language models to generate average embeddings over text fields such as query and title. An embedding layer encodes sparse features like shop ids into shop embeddings. Numeric features are z-score normalized and categorical features are one-hot encoded. The sigmoid output produces a probability of conversion given the clicked ad context. 

Evaluating Post-Click Conversion Rate 

It’s not always easy to evaluate the quality of the predictions of a machine-learning model. The metrics we really care about are “online” metrics such as click-through-rate (CTR) or  return-on-ad-spend (ROAS). An increase in CTR indicates that we are showing our buyers listings that are more relevant to them. An increase in ROAS indicates that we are giving our sellers a better return on the money they spend with us. 

The only way to measure online metrics is to do an A/B test. But these tests take a long time to run, so it is beneficial to have a strong offline evaluation system that allows us to gain confidence ahead of online experimentation. From a product perspective, offline evaluation metrics can be used as benchmarks to guide roadmaps and ensure each incremental model improvement translates to business objectives. 

For our post-click conversion rate model, we chose to monitor Precision-Recall AUC (PR AUC) and ROC AUC. The difference between PR AUC and ROC AUC is that PR AUC is more sensitive to the improvements for the positive class, especially when the dataset is extremely imbalanced. In our case, we use PR AUC as our primary metric since we did not downsample the negative examples in our training data. 

Below we’ve summarized the relative improvements of both models over the baseline non-contextual system. 

The GBDT model outperformed the non-contextual model on both PR AUC (+105.39%) and ROC AUC (+14.59%). These results show that the contextual features added an important signal to the PC-CVR prediction. The large increase in PR AUC indicates that the improved performance was particularly dramatic for the purchase class. 

The online experimental results were also promising. All parties (buyers, sellers and Etsy) benefited from this new bidding system. The buyers saw more relevant results from ads (+6.19% CTR), the buyers made more purchases from ads (+3.08% PCCVR) and the sellers received more sales for every dollar they spent on Etsy Ads (+7.02% ROAS). 

The neural network model saw even stronger offline and online results, validating our hypothesis that PR AUC was the most important offline metric to track and reaffirming the benefit of continuing to improve our PC-CVR model.

Transforming Post-Click Conversion Predictions to Bid Prices

We were very pleased with the offline performance we were able to get from our PC-CVR model, but we quickly hit a roadblock when we tried to turn our PC-CVR scores into bid prices. 

As the figure on the left shows, the distribution of predicted PC-CVR scores is right skewed and has a long tail. This distribution of scores is not surprising, given that our data is extremely imbalanced and sparse. But the consequence is that we cannot use the raw PC-CVR scores to generate our bid prices because the PC-CVR component would dominate the bidding system. We do not want such gigantic variances in bid prices and we never want to charge our sellers above $2 for a click. 

The graph on the left is the distribution of scores from our PC-CVR model. The distribution on the right is the transformed distribution of scores that we used to generate bid prices. 

To solve these issues, we refitted the logit of predicted PC-CVR with a smoother logistic regression. Below is the proposed function:

where x is predicted post-click conversion rate, k and b are hyperparameters. Note that if k=1 and b=0, this function is the identity function. 

We set the k and b parameters such that the average bid price was unchanged from the old system to the new system. While we did not guarantee that each seller’s bid prices would remain consistent, at least we could guarantee that the average bid prices would not change. The final distribution is shown in the right-hand graph above. 

Conclusion

One topic we didn’t dive into in this post is the difficulty of experimenting with our bidding system. For one thing, real money is involved whenever we run an A/B test which means that we need to be very careful about how we adjust the system. Nor is it possible to truly isolate the variants of an A/B test because all of the variants use the same pool of available budget. And once we changed our bidding system to use a contextual model, we moved from having to make batch predictions once a day to having to make online predictions up to 12,000 times a second.

We’re planning a follow-up post on some of the machine-learning infrastructure challenges we faced when scaling this system. But for now we’ll just say that the effort was worth it. We took on a lot of responsibility two years ago when we moved all our sellers to our automated bidding system, but we can confidently say that even the most sophisticated of them (and we have some very sophisticated sellers!) are benefitting from our improved bidding system. 

1 Comment

How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020

Posted by on February 25, 2021 / 2 Comments

The Challenge

For Etsy, 2020 was a year of unprecedented and volatile growth. (Simply on a human level it was also a profoundly tragic year, but that’s a different article.) Our site traffic leapt up in the second quarter, when lockdowns went into widespread effect, by an amount it normally would have taken several years to achieve. 

When we looked ahead to the holiday season, we knew we faced real uncertainty. It was difficult to predict how large-scale economic, social and political changes would affect buyer behavior. Would social distancing accelerate the trends toward online shopping? Even if our traffic maintained a normal month-over-month growth rate from August to December, given the plateau we had reached, that would put us on course deep into uncharted territory.

Graph of weekly traffic for 2016-2020. This was intimidating! 

If traffic exceeds our preparation, would we have enough compute resources? Would we hit some hidden scalability bottlenecks? If we over-scaled, there was a risk of wasting money and spoiling environmental resources. If we under-scaled, it could have been much worse for our sellers. 

For context about Etsy, as of 2020 Q4 we had 81 million active buyers and over 85 million items for sale.

Modulating Our Pace of Change

When we talk about capacity, we have to recognize that ultimately Etsy’s community of sellers and employees are the source of it. The holiday shopping season is always high-stakes so we adapt our normal working order each year so that we don’t over-burden that capacity.

In short, we discourage, for a few weeks, deploying changes that might be expected to disrupt sellers, support, or infrastructure. Necessary changes can still get pushed, but the standards are higher for preparation and communication. We call this period “Slush” because it’s not a total freeze and we’ve written about it before

For more than a decade, we have maintained Slush as an Etsy holiday tradition. Each year, however, we iterate and adapt the guidance, trying to strike the right balance of constraints. It’s a living tradition.

Modeling History To Inform Capacity Planning

Even though moving to Google Cloud Platform (GCP) in 2018 vastly streamlined our preparations for the holidays, we still rely on good old-fashioned capacity planning. We look at historical trends and system resource usage and communicate our infrastructure projections to GCP many weeks ahead of time to ensure they will have the right types of machines in the right locations. For this purpose, we share upper-bound estimates, because this may become the maximum available to us later. We err on the side of over-communicating with GCP. 

As we approached the final weeks of preparation and the ramp up to Cyber Monday, we wanted to understand whether we were growing faster or slower than a normal year. Because US Thanksgiving is celebrated on the fourth Thursday of November rather than a set date, it moves around in the calendar and it takes a little effort to perform year-to-year comparisons. My teammate Dany Daya built a simple model that looked at daily traffic but normalized the date using “days until Cyber Monday.” 

This would come in very handy as a stable benchmark of normal trends when customer patterns shifted unusually. 

Adapting Our “Macro” Load Testing

Though we occasionally write synthetic load tests to better understand Etsy’s scalability, I’m generally skeptical about what we can learn using macro (site-wide) synthetic tests. When you take a very complex system and make it safe to test at scale, you’re often looking at something quite different from normal operation in production. We usually get the most learning per effort by testing discrete components of our site, such as we do with search.

Having experienced effectively multiple years of growth during Q2, and knowing consumer patterns could change unexpectedly, performance at scale was now an even bigger concern. We decided to try a multi-team game day with macro load testing and see what we could learn. We wanted an open-ended test that could help expose bottlenecks as early as possible.

We gathered together the 15-plus teams that are primarily responsible for scaling systems. We talked about how much traffic we wanted to simulate, asked about how the proposed load tests might affect their systems and whether there were any significant risks. 

The many technical insights of these teams deserve their own articles. Many of their systems can be scaled horizontally without much fuss, but the work is a craft: predicting resource requirements, anticipating bottlenecks, and safely increasing/decreasing capacity without disrupting production. All teams had at least one common deadline: request quota increases from GCP for their projects by early September.

We were ready to practice together as part of a multi-team “scale day” in early October. We asked everyone to increase their capacity to handle 3x the traffic of August and then we ran load tests by replaying requests (our systems served production traffic plus synthetic replays). We gradually ramped up requests, looking for signs of increased latency, errors, or system saturation.

While there are limitations to what we can learn from a general site-wide load test, scale day helped us build confidence.

But crucially, we confirmed many systems looked like they could handle scale beyond what we expected at the peak of 2020. 

Cresting The Peak

Let’s skip ahead to Cyber Monday, which is typically our busiest day of the year. Throughput on our sharded MySQL infrastructure peaked around 1.5 million queries per second. Memcache throughput peaked over 20M requests per second. Our internal http API cluster served over 300k requests per second. 

Normally, no one deploys on Cyber Monday. Our focus is on responding to any emergent issues as quickly as possible. But 2020 threw us another curve: postal service interruptions meant that our customers were facing widespread package delivery delays. It only needed a small code change to better inform our buyers about the issue, but we’d be deploying it at the peak hour of our busiest day. And since it would be the first deployment of that day, the entire codebase would need to be compiled from scratch, in production, on more than 1000 hosts.

We debated waiting till morning to push the change, but that wouldn’t have served our customers, and we were confident we could push at any time. Still, as Todd Mazierski and Anu Gulati began the deploy, we started nominating each other for hypothetical Three Armed Sweaters Awards. But the change turned out to be sublimely uneventful. We have been practicing continuous deployment for more than a decade. We have invested in making deployments safe and easy. We know our tools pretty well and we have confidence in them.

Gratitude

We have long maintained a focus on scalability at Etsy, but we all expected to double traffic over a period of years, not just a few months. We certainly did not expect to face these challenges while working entirely distributed during a pandemic. 

We made it to Christmas with fewer operational issues than we’ve experienced in recent memory. I think our success in 2020 underscored some important things about Etsy’s culture and technical practices.

We take pride in operational excellence, meaning that every engineer takes responsibility not just for their own code, but for how it actually operates in production for our users. When there is an outage, we always have more than enough experts on hand to mitigate the issue quickly. When we hold a Blameless Postmortem, everyone shares their story candidly. When we discover a technical or organization liability, we try to acknowledge it openly rather than hide it.  All of this helps to keep our incidents small. 

Our approach to systems architecture values long-term continuity, with a focus on a small number of well-understood tools, and that provided us the ability to scale with confidence. So while 2020 had more than its share of surprising circumstances, we could still count on minimal surprises from our tools.

2 Comments

Bringing Personalized Search to Etsy

Posted by on October 29, 2020 / No Responses

The Etsy marketplace brings together shoppers and independent sellers from all over the world. Our unconventional inventory presents unique challenges for product search, given that many of our listings fall outside of standard e-commerce categories. With more than 80 million listings and 3.7 million sellers, Etsy relies on machine learning to help users browse creative, handmade goods in their search results. But what if we could make the search experience even better with results tailored to each user? Enter personalized search results. 

Search results for “tray”

default trays

(above) default results, (below) a user who recently interacted with leather goods

leather trays

When a user logs into the marketplace and searches for items, they signal their preferences by interacting with listings that pique their interest. In personalization, our algorithms train on these signals and learn to predict, per user, the most relevant listings. The resulting personalized model lets individuals’ taste shine through in their search results without compromising performance. Personalization enhances the underlying search model by customizing the ordering of relevant items according to user preference. Using a combination of historical and contextual features, the search ranking model learns to recognize which items have a greater alignment with an individual’s taste.

In the following sections, we describe the Etsy search architecture and pipeline, the features we use to create personalized search results, and the performance of this new model. Finally, we reflect on the challenges from launching our first personalized search model and look ahead to future iterations of personalization. Please note that some sections of the post are more technical and assume knowledge of machine learning basics from the reader. 

Etsy Search Architecture

The search pipeline is separated into two passes: candidate set retrieval and ranking. This ensures that we are returning low-latency results — a crucial component of a good search system. Because the latter ranking step is computationally expensive, we want to use it wisely. So from millions of total listings, the candidate set retrieval step selects the top one thousand items for a query by considering tags, titles, and other seller-provided attributes. This allows us to run the ranking algorithm over less than one percent of all listings. In the end, the ranker will place the most relevant items at the top of the search results page.

Our search ranking algorithm is an ensemble model that uses a gradient boosted decision tree with pairwise formulation. For personalization, we introduce a new set of features that allow us to model user preferences. These features are included in addition to existing features, such as listing price and recency. And as much as we try to avoid impacting latency, the introduction of these personalized features creates new challenges in serving online results. Because the features are specific to each user, the cache utilization rate drops. We address this and other challenges in a later section.

Personalized User Representations

The novelty of personalized search results lies in the new features we pass to the ranker. We categorize personalization features into two groups: historical and contextual features. 

Historical features refer to singular data points about a user’s profile that can succinctly describe shopping habits and behaviors. Are they modern consumers of digital goods, or are they hunting and gathering vintage pieces? Do they carefully deliberate on each individual purchase, or do they follow their lightning-strike gut instinct? We can gather these insights from the number of digital or vintage items purchased and average number of site visits. Historical user features help us put the “person” in personalization. 

Search results for “lamp”

(above) default results, (below) a user who recently interacted with epoxy resin items

In contrast to these numerical features, data can also be represented as a vector, or a list of numbers. For personalization, we refer to these vectored features as contextual features because the listing vector represents a listing with respect to the context of all other listings. In fact, there are many ways to represent a listing as a vector but we use term frequency–inverse document frequency (Tf-Idf), item-interaction embeddings and interaction-based graph embeddings. If you’re unfamiliar with any of these methods, don’t worry! We’ll be diving deeper into the specific vector generation algorithms. 

So how do we capture a user’s preferences from a bunch of listing vectors? One method is to average all the listings a user has clicked on to represent them. In other words, the user contextual vector is simply the average of all the interacted listings’ contextual vectors. 

We gather historical and contextual features from across our mobile web, desktop and mobile application platforms. This allows us to maximize the amount of information our model can use to personalize search result rankings.

The Many Ways Users Show Us Love

In addition to clicks on a listing from search results, a user has a few other ways to connect with sellers’ items on the marketplace. After a user searches for an item, they can favorite items in the search results page and save them to their own curated collections, they can add an item to their cart while they continue to browse, and once they are satisfied with their selection they can purchase the item.

Each of these interactions has distinct characteristics which help our model generalize and generate more accurate predictions. Clicks are by far the most popular way for buyers to engage with listings, and through sheer quantity provide for a rich source of material to model user behaviors. On the other end, purchase interactions occur less frequently than clicks but contain stronger indications of relevance of an item to the user’s search query. 

The Heart of Personalization

Now, let’s get to the crux of personalization and dig deeper into user contextual features.

Tf-Idf vectors consider listings from a textual standpoint, where words in the seller-provided attributes are weighted according to their importance. These attributes include listing titles, tags, and others. Each word’s importance is derived with respect to its frequency in the immediate listing text, but also the larger corpus of listing texts. This allows us to distinguish a listing from others and capture its unique qualities. When we average the last few months’ worth of listings a user has interacted with, we are averaging the weights of words in those listings to create a single Tf-Idf vector and represent the user. In other words, in Tf-Idf a listing is represented by its most important listing words and a user is represented as an average of those most important words.

Diagram of interaction-based graph embedding

In this example of interaction-based graph embeddings, the queries “dollhouse” and “dolls” resulted in clicks on listing 824770513 on three and five occasions, respectively.

Unlike Tf-Idf, an interaction-based graph embedding can capture the larger interaction context of query and listing journeys. Recall that interactions can be as clicks, favorites, add-to-carts or purchases from a user. Let’s say we have a query and some listings that are often clicked with that query. When we match words within the query to the words in the listing texts and weigh the words common to both of them, we can represent listings and queries with the same vocabulary. A common vocabulary is an important quality in textual representations because we can derive degrees of relatedness between queries to listings despite differences in length and purpose. Therefore, if a few different listings are all clicked as a result of the same query, we expect the embeddings for these listings to be similar.

And similar to Tf-Idf, we can simply average the weights of words in the sparse vectors over some time frame. Whereas graph embeddings weave behavior from interaction logs into the vector representation, Tf-Idf only uses available listing text. Put more plainly, for graph embeddings users tell us which queries and listings are related and we model this information by finding overlaps between their words.

Diagram from Learning Item-Interaction Embeddings for User Recommendations

However, focusing on a single interaction type within an embedding can be limiting. In reality, users can have a combination of different interactions within a single shopping session. Item-interaction vectors can learn multiple interactions in the same space. Created by our very own data scientists here at Etsy, item-interaction vectors build upon the methods of word2vec where words occurring in the same context share a higher vector similarity.

The implementation of item-interaction vectors is simple but elegant: we replace words and sentences with item-interaction token pairs and sequences of interacted items. A token pair is formulated as (item, interaction-type) to represent how a user interacted with a specific item or listing. And an ordered list of these tokens represents the sequence of what and how a user interacted with various listings in a session. As a result, item-interaction token pairs that appear in the same context will be considered similar.

Because these listings embeddings are dense vectors, we can easily find similar listings via distance metrics. To summarize item-interaction vectors, similar to interaction-based graph embeddings we let the users guide us in learning which listings are similar. But rather than deriving listing similarities from query and listing relationships, we infer similarity if listings appear in the same sequences of interactions.

Putting It All Together

Let’s take stock of what we have to work with: recent or lifetime look back windows, three types of contextual features (Tf-Idf, graph embedding, item-interaction), and four types of user behavior interactions (click, favorite, add-to-cart, purchase). Mixing and matching these together, we have a grand total of 24 contextual vectors to represent a single user in order to rank items for personalized search results. For example, we can combine an overall time window, item-interaction method, and “favorite” interactions to generate an item-interaction vector that represents a user’s all-time favorite listings.

Search results for “necklace charms blue”

(above) default results, (below) a user who recently interacted with eye charms

In personalized search ranking, when a user enters a query we still do a coarse sweep of the inventory and grab top-related items to a query in candidate set retrieval. But in the ranking of items, we now include our new features. Recall that decision trees take input features in the form of integers or decimals.

To satisfy this requirement, we can pass user historical features straight through to the tree or create new features by combining them with other features beforehand. To include user contextual features in the ranking, we have to compute similarity metrics between users’ contextual vectors and the candidate listing vectors from the candidate retrieval step. We derive Jaccard similarity and token overlap for sparse vectors and cosine similarity for dense vectors. From these metrics we understand which candidate listings are more similar to listings a user has previously interacted with. However, this metric alone is not sufficient to determine a final ranking.

Decision trees take these inputs and learn how each feature impacts whether an item will be purchased. We feed user historical features, similarity measures, and other non-personalized features into the tree so it can learn to rank listings from most relevant to least. The expectation is that the most relevant listings are the items a user is more likely to purchase.

Personalized Search Performance    

In online A/B experiments we compared this personalized model with a control and observed an improvement in ranking performance from a purchase’s normalized discounted cumulative gain (NDCG). NDCG captures the goodness of a ranking. If, on average, users purchase items ranked higher on the page, this ranking would have a high purchase NDCG. In our experiments, we observed that the NDCG for personalization was especially high for users that have recently and/or often interacted with the marketplace.

Search results for “print maxi dress”

(above) default results, (below) a user who recently interacted with African prints

Users also click around less in personalized results, index to fewer pages, and buy more items compared to the control model. This indicates that users are finding what they want faster with the personalized variant.

Overall, personalization features play an important role relative to the existing features for our decision tree. Generally speaking, the importance gain of a feature in a decision tree indicates how much a feature contributes to better predictions. In personalization, contextual features prove to be a strong influence in determining a good final ranking. For users with a richer history of interactions, we provide even better personalized results. This is confirmed by a greater number of purchases online and increased NDCG offline for users that have recently purchased items more than once. Vector representations from recent time windows had greater feature importance gain compared to lifetime vectors. This means that users’ recent interactions with listings give a better indication of what the user wants to purchase. 

Out of the three user contextual feature types, the text-based Tf-Idf vectors tend to have higher feature importance gain. This might suggest that ranking items based on seller-provided attributes given a query is the best way to help users find what they are looking for.

We also identify users’ clicked and favorited items as more important signals compared to past cart adds or purchases. This might indicate that if a user purchased an item once, they have less utility for highly similar items later. 

Challenges & Considerations

Conclusion

In this post, we have covered how Etsy achieves personalized search ranking results. Our models learn which listings should rank higher for a user based on their own history as well as others’ history. These features are encapsulated with user historical features and contextual features. Since launching personalization, users have been able to find items they liked more easily, and they often come back for more. At Etsy, we’re focused on connecting our vibrant marketplace of 3.7 million sellers with shoppers around the world. With the introduction of personalized ranking, we hope to maintain course in our mission to keep commerce human. 

No Comments

Improving our design system through Dark Mode

Posted by on October 21, 2020 / 1 Comment

Etsy recently launched Dark Mode in our iOS and Android buyer apps. Since Dark Mode was introduced system-wide last year in iOS 13 and Android 10, it has quickly become a highly requested feature by our users and an industry standard. Benefits of Dark Mode include reduced eye strain, accessibility improvements, and increased battery life. For the Design Systems team at Etsy, it was the perfect opportunity to test the limits of the new design system in our apps.

In 2019, we brought Etsy’s design system, Collage, to our iOS and Android apps. Around the same time, Apple announced Dark Mode in iOS 13. By implementing Dark Mode, the Design Systems team could not only give users the flexibility to customize the appearance of the Etsy app to match their preferences, but also test the new UI components app-wide and increase adoption of Collage in the process.

Semantic colors

Without semantic colors, Dark Mode wouldn’t have been possible. Semantic colors are colors that are named relative to their purpose in the UI (e.g. primary button color) instead of how they look (e.g. light orange color). Collage used a semantic naming convention for colors from the beginning. This made it relatively easy to support dynamic colors, which are named colors that can have different values when switching between light and dark modes. For example, a dynamic primary text color might be black in light mode and white in Dark Mode.

Dynamic semantic colors opened up the possibility for Dark Mode, but they also led to a more accessible app for everyone. On iOS, we also added support for the Increased Contrast accessibility feature which increases the contrast between text and backgrounds to improve legibility. Any color in the Etsy app can now have up to four values for light/dark modes and regular/increased contrast.

Color generation

To streamline the process for adding new colors, we created a script on iOS that generates all of our color assets and files. With the growing complexity of dynamic colors, having a single source of truth for color definitions is important. On iOS, our source of truth is a property list (a key-value store for app data) of color names and values. We created a script that automatically runs when the app is built and generates all the files we need to represent colors: an asset catalog, convenience variables for accessing colors in code, and a source file for the unit tests. Adding a new color is as simple as adding a line to the property list, and all the relevant files are updated for you. This approach has reduced the time it takes to add a new color and eliminated the risk of inconsistencies across the codebase.

On iOS, a script reads from a property list to generate the color asset catalog and convenience variables.

Rethinking elevation

Another design change we made for Dark Mode was rethinking how we represent elevation in our UI components. In light mode, it’s common to add a shadow around your view or dim the background to show that one view is layered above another. In Dark Mode, these approaches aren’t as effective and the platform convention is to slightly lighten the background of your view instead. The Etsy app uses shadows and borders extensively to indicate levels of elevation. For Dark Mode, we removed shadows entirely and used borders much more sparingly. Instead, we followed iOS and Android platform conventions and introduced elevated background colors into our design system. Semantic colors came to the rescue again and we were easily able to use our regular background color in light mode while applying a lighter color in Dark Mode on views that needed it, such as listing cards.

Examples of elevated cards in light and dark modes

Choose your theme

There is no system-level Dark Mode setting available on older versions of Android, but it can be enabled for specific apps that support Themes (an Android feature that allows for UI customization and provides the underlying structure for Dark Mode). This limitation turned into an opportunity for us to provide more customization options for all our users. In both our iOS and Android apps you can personalize the appearance of the Etsy app to your preferences. So if you want to keep your phone in light mode but the Etsy app needs to be easy on the eyes for late night shopping, we’ve got you covered.

Dark Mode in web views

Another obstacle to overcome was our use of web views, a webpage that is displayed within a native app. Web views are used in a handful of places in our iOS and Android apps, and we knew that for a great user experience they needed to work seamlessly in Dark Mode as well. Thankfully, the web engineers on the Design Systems team jumped in to help and devised a solution to this problem. Using the Sass !default syntax for variables, we were able to define default color values for light mode. Then we added Dark Mode variable overrides where we defined our Dark Mode colors. If the webpage is being viewed from within the iOS or Android app with Dark Mode enabled, we load the Dark Mode variables first so the default (light mode) color variables aren’t used because they’ve already been defined for Dark Mode. This approach is easy to maintain and performant, avoiding a long list of style overrides for Dark Mode.

A better design system

Implementing Dark Mode was no small task. It took months of design and engineering effort from the Design Systems team, in collaboration with apps teams across the company. A big thank you to Patrick Montalto, Kate Kelly, Stephanie Sharp, Dennis Kramer, Gabe Kelley, Sam Sherman, Matthew Spencer, Netty Devonshire, Han Cho and Janice Chang. In the end, our users not only got the Dark Mode they’d been asking for, but we also developed a more robust and accessible design system in the process.

1 Comment

Mutation Testing: A Tale of Two Suites

Posted by on August 17, 2020 / 4 Comments

In January of 2020 Etsy engineering embarked upon a highly anticipated initiative. For years our frontend engineers had been using a JavaScript test framework that was developed in house. It utilized Jasmine for assertions and syntax, PhantomJS for the test environment, and a custom, built-from-scratch test runner written in PHP. This setup no longer served our needs for a multitude of reasons:

It was time to reach for an industry standard tool with a strong community and a feature list JavaScript developers have come to expect. We settled on Jest because it met all of those criteria and naturally complemented the areas of our site that are built with React. Within a few months we had all of the necessary groundwork in place to begin a large-scale effort of migrating our legacy tests. This raised an important question — would our test suite be as effective at catching regressions if it was run by Jest as opposed to our legacy test runner?

At first this seemed like a simple question. Our legacy test said:

expect(a).toEqual(b);

And the migrated Jest test said:

expect(a).toEqual(b);

So weren’t they doing the same thing? Maybe. What if one checked shallow equality and the other checked deep equality? And what about assertions that had no corollaries in Jest? Our legacy suite relied on jasmine.ajax and jasmine-jquery, and we would need to propose alternatives for both modules when migrating our tests. All of this opened the door for subtle variations to creep in and make the difference between catching and missing a bug. We could have spent our time poring through the source code of Jasmine and Jest to figure out if these differences really existed, but instead we decided to use Mutation Testing to find them for us.

What is Mutation Testing?

Mutation Testing allows developers to score their test suite based on how many potential bugs it can catch. Since we were testing our JavaScript suite we reached for Stryker, which works roughly the same as any other Mutation Testing framework. Stryker analyzes our source code, makes any number of copies of it, then mutates those copies by programmatically inserting bugs into them. It then runs our unit tests against each “mutant” and sees if the suite fails or passes. If all tests pass, then the mutant has survived. If one or more tests fail, then the mutant has been killed. The more mutants that are killed, the more confidence we have that the suite will catch regressions in our code. After testing all of these potential mutations, Stryker generates a score by dividing the number of mutants that were killed by the total number generated:

Output from running Stryker on a single file

Stryker’s default reporter even displays how it generated the mutants that survived so it’s easy to identify the gaps in the suite. In this case, two Conditional Expression mutants and a Logical Operator mutant survived. All together, Stryker supports roughly thirty possible mutation types, but that list can be whittled down for faster test runs.

The Experiment

Since our hypothesis was that the implementation differences between Jasmine and Jest could affect the Mutation Score of our legacy and new test suites, we began by cataloging every bit of Jasmine-specific syntax in our legacy suite. We then compiled a list of roughly forty test files that we would target for Mutation Testing in order to cover the full syntax catalog. For each file we generated a Mutation Score for its legacy state, converted it to run in our new Jest setup, and generated a Mutation Score again. Our hope was that the new Jest framework would have a Mutation Score as good as or better than our legacy framework.

By limiting the scope of our test to just a few dozen files, we were able to run all mutations Stryker had to offer within a reasonable timeframe. However, the sheer size of our codebase and the sprawling dependency trees in any given feature presented other challenges to this work. As I mentioned before, Stryker copies the source code to be mutated into separate sandbox directories. By default, it copies the entire project into each sandbox, but that was too much for Node.js to handle in our repository:

Error when opening too many files

Stryker allows users to configure an array of files to copy over instead of the entire codebase, but doing so would require us to know the full dependency tree of each file that we hoped to test ahead of time. Instead of figuring that out by hand, we wrote a custom Jest file resolver specifically for our Stryker testing environment. It would attempt to resolve source files from the local directory structure, but it wouldn’t fail immediately if they weren’t found. Instead, our new resolver would reach outside of the Stryker sandbox to find the file in the original directory structure, copy it into the sandbox, and re-initiate the resolution process. This method saved us time for files that had very expansive dependency trees. With that in hand, we pressed forth with our experiment.

The Result

Ultimately we found that our new Jest framework had a worse Mutation Score than our legacy framework.

…Wait, What?

It’s true. On average, tests run by our legacy framework received a 55.28% Mutation Score whereas tests run by our new Jest framework received a 54.35%. In one of our worst cases, the legacy test earned a 35% while the Jest test picked up a measly 16%.

Analyzing The Result

Once we began seeing lower Mutation Scores on a multitude of files, we put a hold on the migration to investigate what sort of mutants were slipping past our new suite. It turned out that most of what our new Jest suite failed to catch were String Literal mutations in our Asynchronous Module Definitions:

Mutant generated by replacing a dependency definition with an empty string

We dug into these failures further and discovered that the real culprit was how the different test runners compiled our code. Our legacy test runner was custom built to handle Etsy’s unique codebase and was tightly coupled to the rest of our infrastructure. When we kicked off tests it would locate all relevant source and test files, run them through our actual webpack build process, then load the resulting code into PhantomJS to execute. When webpack encountered empty strings in the dependency definition it would throw an error and halt the test, effectively catching the bug even if there were no tests that actually relied on that dependency.

Jest, on the other hand, was able to bypass our build system using its file resolver and a handful of custom mappings and transformers. This was one of the big draws of the migration in the first place; decoupling the tests from the build process meant they could execute in a fraction of the time. However, the module we used in Jest to manage dependencies was much more lenient than our actual build system, and empty strings were simply ignored. This meant that unless a test actually relied on the dependency, our Jest setup had no way to alert the tester if it was accidentally left out. Ultimately we decided that this sort of bug was acceptable to let slide. While it would no longer be caught during the testing phase, the code would still be rejected by the build phase of our CI pipeline, thereby preventing the bug from reaching Production.

As we proceeded with the migration we encountered a handful of other cases where the Mutation Scores were markedly different, one of which is particularly notable. We happened upon an asynchronous test that used a done() callback to signify when the test should exit. The test was malformed in that there were two done() callbacks with assertions between them. In Jest this was no big deal; it happily executed the additional assertions before ending the test. Jasmine was much more strict though. It stopped the test immediately when it encountered the first callback. As a result, we saw a significant jump in Mutation Score because mutants were suddenly being caught by the dangling assertions. This validated our suspicion that implementation differences between Jasmine and Jest could affect which bugs were caught and which slipped through.

The Future of Mutation Testing at Etsy

Over the course of this experiment we learned a ton about our testing frameworks and Mutation Testing in general. Stryker generated more than 3,800 mutations for the forty or so files that were tested, which equates to roughly ninety-five test runs per file. In all transparency, that number is likely to be artificially low as we ruled out some of the files we had initially identified for testing when we realized they generated many hundreds of mutations. If we assume our calculated average is indicative of all files and account for how long it takes to run our entire Jest suite, then we can estimate that a single-threaded, full Mutation Test of our entire JavaScript codebase would take about five and a half years to complete. Granted, Stryker parallelizes test runs out of the box, and we could potentially see even more performance gains using Jest’s findRelatedTests feature to narrow down which tests are run based on which file was mutated. Even so, it’s difficult to imagine running a full Mutation Test on any regular cadence.

While it may not be feasible for Etsy to test every possible mutant in the codebase, we can still gain insights about our testing practices by applying Mutation Testing at a more granular level. A manageable approach would be to generate a Mutation Score automatically any time a pull request is opened, and focus the testing on only the files that changed. Posting that information on the pull request could help us understand what conditions will cause our unit tests to fail. It’s easy to write an overly-lenient test that will pass no matter what, and in some ways that’s more dangerous than having no test at all. If we only look at Code Coverage, such a test boosts our numbers giving us a false sense of security that bugs will be caught. Mutation Score forces us to confront the limitations of our suite and encourages us to test as effectively as possible.

4 Comments

How to Pick a Metric as the North Star for Algorithms to Optimize Business KPI? A Causal Inference Approach

Posted by on August 3, 2020 / 1 Comment

This article draws on our published paper in KDD 2020 (Oral Presentation, Selection Rate: 5.8%, 44 out of 756)

Introduction

It is common in the internet industry to develop algorithms that power online products using historical data.  An algorithm that improves evaluation metrics from historical data will be tested against one that has been in production to assess the lift in key performance indicators (KPIs) of the business in online A/B tests.  We refer to metrics calculated using new predictions from an algorithm and historical ground truth as offline evaluation metrics.  In many cases, offline evaluation metrics are different from business KPIs.  For example, a ranking algorithm, which powers search pages on Etsy.com, typically optimizes for relevance by predicting purchase or click probabilities of items.  It could be tested offline (offline A/B tests) for rank-aware evaluation metrics, for example, normalized discounted cumulative gain (NDCG), mean reciprocal rank (MRR) or mean average precision (MAP), which are calculated using predicted ranks of ranking algorithms on the test set of historical purchase or click-through feedback of users.  Most e-commerce platforms, however, deem sitewide gross merchandise sale (GMS) as their business KPI and test for it online.  There could be various reasons not to directly optimize for business KPIs offline or use business KPIs as offline evaluation metrics, such as technical difficulty, business reputation, or user loyalty.  Nonetheless, the discrepancy between offline evaluation metrics and online business KPIs poses a challenge to product owners because it is not clear which offline evaluation metric, among all available ones, is the north star to guide the development of algorithms in order to optimize business KPIs.

The challenge essentially asks for the causal effects of increasing offline evaluation metrics on business KPIs, for example how business KPIs would change for a 10% increase in an offline evaluation metric with all other conditions remaining the same (ceteris paribus).  The north star should be the offline evaluation metric that has the greatest causal effects on business KPIs.  Note that business KPIs are impacted by numerous factors, internal and external, especially macroeconomic situations and sociopolitical environments.  This means, just from changes in offline evaluation metrics, we have no way to predict future business KPIs.  Because we are only able to optimize for offline evaluation metrics to affect business KPIs, we try to infer the change in business KPIs, given all other factors, internal and external, unchanged, from changes in our offline evaluation metrics, based on historical data.  Our task here is causal inference rather than prediction.

Our approach is to introduce online evaluation metrics, the online counterparts of offline evaluation metrics, which measure the performance of online products (see Figure 1).  This allows us to decompose the problem into two parts: the first part is the consistency between changes of offline and online evaluation metrics, the second part is the causality between online products (assessed by online evaluation metrics) and the business (assessed by online business KPIs).  The first part is solved by the offline A/B test literature through counterfactual estimators of offline evaluation metrics.  Our work focuses on the second part.  The north star should be the offline evaluation metric whose online counterpart has the greatest causal effects on business KPIs.  Hence, the question becomes how business KPIs would change for a 10% increase in an online evaluation metric ceteris paribus.

Figure 1: The Causal Path from Algorithm Trained Offline to Online Business
Note: Offline algorithms powers online products, and online products contribute to the business.

Why Causality?

Why do we focus on causality?  Before answering this question, let’s think about another interesting question: thirsty crow vs. talking parrot, which one is more intelligent (see Figure 2)? 

Figure 2. Thirsty Crow vs. Talking Parrot
Note: The left painting is from The Aesop for Children, by Aesop, illustrated by Milo Winter, http://www.gutenberg.org/etext/19994

In Aesop’s Fables, a thirsty crow found a pitcher with water at the bottom.  The water is beyond the reach of its beak.  It intentionally dropped pebbles into the pitcher, which caused the water to rise to the top.  A talking parrot cannot really talk.  After being fed a simple phrase tons of times (big data and machine learning), it can only mimic the speech without understanding its meaning.

The crow is obviously more intelligent than the parrot.  The crow understood the causality between dropped pebbles and rising water and thus leveraged the causality to get the water.  Beyond big data and machine learning (talking parrot), we want our artificial intelligence (AI) system to be as intelligent as the crow.  After understanding the causality between evaluation metric lift and GMS lift, our system can leverage the causality, by lifting the evaluation metric offline, to achieve GMS lift online (see Figure 3).  Understanding and leveraging causality are key topics in current AI research (see, e.g., Bergstein, 2020).

Figure 3: Understanding and Leveraging Causality in Artificial Intelligence

Causal Meta-Mediation Analysis

Online A/B tests are popular to measure the causal effects of online product change on business KPIs.  Unfortunately, they cannot directly tell us the causal effects of increasing offline evaluation metrics on business KPIs.  In online A/B tests, in order to compare the business KPIs caused by different values of an online evaluation metric, we need to fix the metric at its different values for treatment and control groups.  Take the ranking algorithm as an example.  If we could fix online NDCG of the search page at 0.22 and 0.2 for treatment and control groups respectively, then we would know how sitewide GMS would change for a 10% increase in online NDCG at 0.2 ceteris paribus.  However, this experimental design is impossible, because most online evaluation metrics depend on users’ feedback and thus cannot be directly manipulated.

We address the question by developing a novel approach: causal meta-mediation analysis (CMMA).  We model the causality between online evaluation metrics and business KPIs by dose-response function (DRF) in potential outcome framework.  DRF originates from medicine and describes the magnitude of the response of an organism given different doses of a stimulus.  Here we use it to depict the value of a business KPI given different values of an online evaluation metric.  Different from doses of stimuli, values of online evaluation metrics cannot be directly manipulated.  However, they could differ between treatment and control groups in experiments of treatments other than algorithms: user interface/user experience (UI/UX) design, marketing, and etc.  This could be due to the “fat hand” nature of online A/B tests that a single intervention can change many causal variables at once.  A change of the tested feature, which is not an algorithm, could induce users to change their engagement with algorithm-powered online products, so that values of online evaluation metrics would change.  For instance, in an experiment of UI design, users might change their search behaviors because of the new UI design, so that values of online NDCG, which depend on search interaction, would change even though the ranking algorithm does not change (see Figure 4).  The evidence suggests that online evaluation metrics could be mediators that partially transmit causal effects of treatments on business KPIs in experiments where treatments are not necessarily algorithm-related.  Hence, we formalize the problem as the identification, estimation, and testing of mediator DRF.

Figure 4: Directed Acyclic Graph of Conceptual Framework

Our novel approach CMMA combines mediation analysis and meta-analysis to solve for mediator DRF.  It relaxes common assumptions in causal mediation literature: sequential ignorability (in linear structural equation model) or complete mediation (in instrumental variable approach) and extends meta-analysis to solve causal mediation while the meta-analysis literature only learns the distribution of average treatment effects.  We did extensive simulations, which show CMMA’s performance is superior to other methods in the literature in terms of unbiasedness and the coverage of confidence intervals.  CMMA uses only experiment-level summary statistics (i.e., meta-data) of many existing experiments, which makes it easy to implement and to scale up.  It can be applied to all experimental-unit-level evaluation metrics or any combination of them.  Because it solves the causality problem of a product by leveraging experiments of all products, CMMA could be particularly useful in real applications for a new product that has been shipped online but has few A/B tests.

Application

We apply CMMA on the three most popular rank-aware evaluation metrics: NDCG, MRR, and MAP, to show, for ranking algorithms that power search products, which one has the greatest causal effect on sitewide GMS.

User-Level Rank-Aware Metrics

We redefine the three rank-aware metrics (NDCG, MAP, MRR) at the user level.  The three metrics are originally defined at the query level in the test collection evaluation of information retrieval (IR) literature.  Because the search engine on Etsy.com is an online product for users, the computation could be adapted to the user level.  We include search sessions of no interaction or no feedback into the metric calculation in accordance with online business KPI calculation in online A/B tests that always includes visits/users of no interaction or no feedback.  Specifically, the three metrics are constructed as follows:

  1. Query-level metrics are computed using rank positions on the search page and user conversion status (binary relevance).  Queries of non-conversion have zero values.
  2. User-level metric is the average of query-level metrics across all queries the user issues (including non-conversion associated queries).  Users who do not search or convert have zero values.

All three metrics are defined at rank position 48, the lowest position of the first page of search results in Etsy.com.

Demo

To demonstrate CMMA, we randomly selected 190 experiments from 2018 and implemented CMMA on summary results of each experiment (e.g., the average user-level NDCG per user in treatment and control groups).  Figure 5 shows results from CMMA.  The vertical axis indicates elasticity, the percentage change of average GMS per user for a 10% increase in an online rank-aware metric with all other conditions that can affect GMS remaining the same.  Lifts in all the three rank-aware metrics have positive causal effects on the average GMS per user.  These don’t have the same performance; different values of different rank-aware metrics have different elasticities.  For example, suppose current values of NDCG, MAP, and MRR of search page are 0.00210, 0.00156, and 0.00153 respectively, then a 10% increase in MRR will cause higher lifts in GMS than 10% increases in the other two metrics ceteris paribus, which is indicated by red dashed lines, and thus MRR should be the north star to guide our development of algorithms.  Because all the three metrics have the same input data, they are highly correlated and thus the differences are small.  As new IR evaluation metrics are continuously developed in the literature, we will implement CMMA for more comparison in the future.

Figure 5: Elasticity from Average Mediator DRF Estimated by CMMA
Note: Elasticity means the percentage change of average GMS per user for a 10% increase in an online rank-aware metric with all other conditions that can affect GMS remaining the same.

Conclusion

We implement CMMA to identify the north star of rank-aware metrics for search-ranking algorithms.  It is easy to implement and to scale up. It has helped product and engineering teams to achieve efficiency and success in terms of business KPIs in online A/B tests.  We published CMMA on KDD 2020.  We also made a fun video describing the intuition behind CMMA. Interested readers can refer to our paper for more details and our GitHub repo for the analysis code.

1 Comment