At Etsy, we strive to nurture a culture of continuous learning and rapid innovation. To ensure that new products and functionalities built by teams — from polishing the look and feel of our app and website, to improving our search and recommendation algorithms — have a positive impact on Etsy’s business objectives and success metrics, virtually all product launch decisions are vetted based on data collected via carefully crafted experiments, also known as A/B tests.
With hundreds of experiments running every day on limited online traffic, our desire for fast development naturally calls for ways to gain insights as early as possible in the lifetime of each experiment, without sacrificing the scientific validity of our conclusions. Among other motivations, this need drove the new formation of our Online Experimentation Science team: a team made of engineers and statisticians, whose key focus areas include building more advanced and scalable statistical tools for online experiments.
In this article, we share details about our team’s journey to bring the statistical method known as CUPED to Etsy, and how it is now helping other teams make more informed product decisions, as well as shorten the duration of their experiments by up to 20%. We offer some perspectives on what makes such a method possible, what it took us to implement it at scale, and what lessons we have learned along the way.
Is my signal just noise?
In order to fully appreciate the value of a method like CUPED, it helps to understand the key statistical challenges that pertain to A/B testing. Imagine that we have just developed a new algorithm to help users search for items on Etsy, and we would like to assess whether deploying it will increase the fraction of users who end up making a purchase, a metric known as conversion rate.
A/B testing consists in randomly forming 2 groups — A and B — of users, such that users in group A are treated (exposed to the new algorithm, regarded as a treatment) while users in group B are untreated (exposed to the current algorithm). After measuring the conversion rates YA and YB from group A and group B, we can use their difference YA – YB to estimate the effect of the treatment.
There are essentially two facets to our endeavour — detection and attribution. In other words, we are asking ourselves two questions:
Does the observed difference reflect the existence of a real effect?
If there is a real effect, is it caused by the treatment?
Since our estimated difference is based only on a random sample of observations, we have to deal with at least two sources of uncertainty.
The first layer of randomness is introduced by the sampling mechanism. Since we are only using a relatively small subset of the entire population of users, attempting to answer the first question requires the observed difference to be a sufficiently accurate estimator of the unobserved population-wide difference, so that we can distinguish a real effect from a fluke.
The other important layer of randomness comes from the assignment mechanism. Claiming that the effect is caused by the treatment requires groups A and B to be similar in all respects, except for the treatment that each group receives. As an illustrative thought experiment: pretend that we could artificially label each user as either “frequent” or “infrequent” based on how many times they have visited Etsy in the previous month. If, by chance, or rather mischance, a disproportionately large number of “frequent” users were assigned to group A (Figure 1), then it would call into question whether the observed difference in conversion rate is indeed due to an effect from the treatment, or whether it is simply due to the fact that the groups are dissimilar.
One solution to the attribution question is to exploit the randomization of the assignments, which guarantees that — except for the treatments received — groups A and B will become more and more similar in every way, on average, as their sample sizes increase. Going one step further, if we somehow understood how the type of a user (e.g. “frequent” or “infrequent”) informs their buying habit, then we could attempt to proactively adjust for group dissimilarities, and correct our naive difference YA – YB by removing the explainable contribution coming from apparent imbalances between groups A and B. This is where CUPED comes into play.
What is CUPED?
CUPED is an acronym for Controlled experiments Using Pre-Experiment Data . It is a method that aims to estimate treatment effects in A/B tests more accurately than simple differences in means. As reviewed in the previous section, we traditionally use the observed difference between the sample means
YA – YB
of two randomly-formed groups A and B to estimate the effect of a treatment on some metric Y of interest (e.g. conversion rate). As hinted earlier, one of the challenges lies in disentangling and quantifying how much of this observed difference is due to a real treatment effect, as opposed to misleading fluctuations due to comparing two subpopulations made of non-identical users. One way to render these latter fluctuations negligible is to increase the number of users in each group. However, the required sample sizes tend to grow proportionally to the variance of the estimator YA – YB, which may be undesirably large in some cases and lead to prohibitively long experiments.
The key idea behind CUPED is not only to play with sample sizes, but also to explain parts of these fluctuations away with the help of other measurable discrepancies between the groups. The CUPED estimator can be written as
YA – YB – (XA – XB) β
which corrects the traditional estimator with an additional denoising term. This correction involves the respective sample means (XA and XB) of a well-thought-out vector of user attributes (so-called covariates and symbolized by X), and a vector β of coefficients to be specified. Our earlier example (Figure 1) involved a single binary covariate, but CUPED generalizes the reasoning to multidimensional and continuous variables. Intuitively, the correction term aims to account for how much of the difference in Y is not due to any effect of the treatment, but rather due to differences in other observable attributes (X) of the groups.
By choosing X as a vector of pre-experiment variables (collected prior to each user’s entry into the experiment), we can ensure that the correction term added by CUPED does not introduce any bias. Additionally, the coefficient β can be judiciously optimized so that the variance of the CUPED estimator becomes smaller than the variance of the original estimator, by a reduction factor that relates to the correlation between Y and X. In simple terms, the more information on Y we can obtain from X, the more variance reduction we can achieve with CUPED. In the context of A/B testing, smaller variances imply smaller sample size requirements, hence shorter experiments. Correspondingly, if sample sizes are kept fixed, smaller variances enable larger statistical power for detecting effects (Figure 2).
The benefits and accessibility of CUPED (especially its quantifiable improvement over the traditional estimator, its interpretability, and the simplicity of its mathematical derivation) explain its popularity and widespread adoption by other online experimentation platforms [2, 3, 4, 5].
Implementing CUPED at Etsy
The implementation of CUPED at scale required us to construct a brand new pipeline (Figure 3) for data processing and statistical computation. Our pipeline consists of 3 main steps:
Retrieving (or imputing) pre-experiment data for all users.
Computing CUPED estimators for each group.
Performing statistical tests (t-test) using CUPED estimators.
When a user enters into an experiment for the first time, we attempt to fetch the user’s most recent historical data from the preceding few weeks. The window length was chosen to hit a sweet spot between looking far enough back in time for historical data to exist, but not as far as to render such pre-experiment data unpredictive of the in-experiment outcomes. This step involves some careful engineering in order to retrieve (and possibly reconstruct) the historical data from the pre-experiment period at the level of each individual user.
Once the pre-experiment variables are retrieved and formatted, we may proceed with the CUPED adjustments. As it turns out, the optimal choice of coefficient β coincides with the ordinary-least-squares coefficient of a linear regression of Y on X. This relationship enables the efficient computation of CUPED estimators as residuals from linear regressions, which we implemented at scale using Apache Spark’s MLlib . The well-established properties of linear regressions also allowed us to design non-trivial and interpretable simulations for unit testing.
Since CUPED estimators can be expressed as simple differences in means (albeit using adjusted outcomes instead of raw outcomes), we were able to leverage our existing t-testing framework to compute corresponding p-values and confidence intervals. Besides the adjusted difference between the two groups, our pipeline also outputs the group-level estimates YA – (XA – XB) β and YB for further reporting and diagnosis. Note the intentional asymmetry of the expressions, as YA – XA β and YB – XB β would generally be biased estimators of the group-level means, unless the covariates were properly centered.
Impact and food for thought
Overall, CUPED leads to meaningful improvements across our experiments. However, we observe varying degrees of success, e.g. when comparing different types of pages (Figure 4). This can be explained by the fact that different pages may have different amounts of available pre-experiment data, with different degrees of informativeness (e.g. some pages may be more prone to be seen by newer users, on whom we may not have much historical information).
In favorable cases, our out-of-the-box CUPED implementation can reduce variances by up to 20%, thus leading to narrower confidence intervals and shorter experiment durations (Figure 5). In more challenging cases where pre-experiment data is largely missing or uninformative, the correction term from CUPED becomes virtually 0, making CUPED estimators revert to their non-CUPED counterparts and hence yield no reduction in variance — but no substantial increase either.
On the engineering side of things, one of the lessons we learned from implementing CUPED is the importance of producing and storing experiment data at the appropriate granularity level, so that the retrieval of pre-experiment data can be done efficiently and in a replicable fashion. Scalability also becomes a key desideratum as we expand the application of CUPED to more and more metrics.
Finally, from a methodological standpoint, an interesting reflection comes from noticing that CUPED estimators can achieve smaller variances than their non-CUPED counterparts, at essentially no cost in bias. The absence of any bias-variance trade-off may make one feel skeptical of the seemingly one-sided superiority of CUPED, as one may often hear that there is no such thing as a free lunch. However, it is insightful to realize that … this lunch is not free.
In fact, the conceptual dues that we are paying to reap CUPED’s benefits are at least two-fold. First, we are borrowing information from additional data. Although the in-experiment sample size required by CUPED is smaller compared to its non-CUPED rival, the total amount of data effectively used by CUPED (when combining both in-experiment and pre-experiment data) may very well be larger. That cost is somewhat hidden by the fact that we are organically collecting pre-experiment data as a byproduct of our natural experimentation cycle, but it is an important cost to acknowledge nonetheless. Second, CUPED estimators are computationally more expensive than their non-CUPED analogues, since their linear regressions induce additional costs in terms of execution time and algorithmic complexity.
All this to say: the increased accuracy of CUPED is the fruit of sensible efforts that warrant thoughtful considerations (e.g. in the choice of covariates) and realistic expectations (i.e. not every experiment is bound to magically benefit).
We hope that our work on CUPED can serve as an inspiring illustration of the valuable synergy between engineering and statistics.
We would like to give our warmest thanks to Alexandra Pappas, MaryKate Guidry, Ercan Yildiz, and Lushi Li for their help and guidance throughout this project.
 Deng A., Xu Y., Kohavi R., Walker T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data.
 Xie H., Aurisset J. (2016). Improving the sensitivity of online controlled experiments: case studies at Netflix.
 Jackson S. (2018). How Booking.com increases the power of online experiments with CUPED.
 Kohavi R., Tang D., Xu Y. (2020). Trustworthy online controlled experiments: a practical guide to A/B testing.
 Li J., Tang Y., Bauman J. (2020). Improving experimental power through control using predictions as covariate.
Etsy Ads is an on-site advertising product that gives Etsy sellers the opportunity to promote their handcrafted goods to buyers across our website and mobile apps.
Sellers who participate in the Etsy Ads program are acting as bidders in an automated, real-time auction that allows their listings to be highlighted alongside organic search results, when the ad system judges that they’re relevant. Sellers only pay for a promoted listing when a buyer actually clicks on it.
In 2019, when we moved all of our Etsy Ads sellers to automated bid pricing, we knew it was important to have a strong return on the money our sellers spent with us. We are aligned with our sellers in our goal to drive as many converting and profitable clicks as possible. Over the past year, we’ve doubled down on improvements to the bid system, developing a neural-network-powered platform that can determine, at request time, when the system should bid higher or lower on a seller’s behalf to promote their item to a buyer. We call it contextual bidding, and it represents a significant improvement in the flexibility and effectiveness of the automation in Etsy Ads.
First, Some Background
In the past, Etsy Ads gave sellers the option to set static bids for their listings, based on their best estimate of value. That bid price would be used every time one of their listings matched for a search. If a seller only ever wanted to pay 5 cents for an ad click, they could tell us and we would bid 5 cents for them every time.
The majority of our sellers didn’t have strong opinions about per-listing bid prices, and for them we had an algorithmic auto-bidding system to bid on their behalf. But even for sellers who self-priced, most adopted a “set it and forget it” strategy. Their prices never changed, so our system made the same bid for them at the highest-converting times during the weekend as it did on, say, a Tuesday at 4 a.m. Savvier sellers adjusted their bids day-to-day to take advantage of weekly traffic trends. No matter how proactive they were, though, sellers were still limited because they lacked request-time information about searches. And they could only set one bid price across all ad requests.
Two scatter plots display a sample of ads shown to prospective buyers over the course of one day. Each dot represents a single ad shown: blue, if the listing went on to be purchased, red if it did not. The plots are differentiated by whether the buyer and seller are in the same region. The data suggest a higher likelihood of purchase when buyers are in the same region as sellers and a higher rate of purchase after 12pm. A context-specific bidder can use information like this at the time of a search request to adjust bids.
Our data shows that conversion rates differ across many axes: hour-of-day, day-of-week, platform, page type. In other words, context. We migrated our Etsy Ads sellers to algorithmic bid pricing because we wanted to ensure that all of them, regardless of their sophistication with online advertising, had a fair chance to benefit from promoting their goods on Etsy. But to realize that goal, we knew we needed to build a system that could set context-aware bids.
How to Auto-bid
Etsy Ads ranks listings based on their bid price, multiplied by their predicted click-through rate, in what’s known as a second-price auction. This is an auction structure where the winning bid only has to pay $0.01 above what would be needed to outrank the second-highest bidder in the auction.
In a second-price auction, it is in the advertiser’s best interest to bid the highest amount they are willing to pay, knowing that they will often end up paying less than that. This is the strategy our auto-bidder adopts: it sets a seller’s bid based on the value they can expect from an ad click.
Expected Value of an Ad Click = Predicted Post-Click Conversion Rate x Expected Order Value
To understand the intuition behind this formula, consider a seller who has a listing priced at $100. If, on average, 1 out of every 10 buyers who click on their listing end up purchasing it, the seller could pay $10 per click and (over time) break even. But of course we do not want our sellers just to break even, we want their ads to generate more money in sales than the advertising costs. So the auto-bidder calculates the expected value of an ad click for each listing and then scales those numbers down to generate bid prices.
Bid = Expected Value of an Ad Click * Scaling Factor
At a high level, we are assigning a value to an ad click based on factors like the amount of sales that click is expected to generate..
Obviously, there’s a lot riding on that predicted conversion rate. Even marginal improvement there makes a difference across the whole system. Our old automated bidder was, in a sense, static: it estimated the post-click conversion rate for each listing only once per day. But a contextual bidder, with access to a rich set of attributes for every request, could set bids that were sensitive to how conversion is affected by time of day, day of week, and across platforms. Most importantly, it would be able to take relevance into account, bidding higher or lower based on factors like how the listing ranked for a buyer’s search (all in a way that respects users’ privacy choices.) And it would do all that in real time, with much higher accuracy, thousands of times a second.
Example of a context-specific bid for a “printed towel” search on an iOS device. This listing matched the query, and given its inputs our model predicts that 5 of 1000 clicks should lead to a purchase. With an expected order value of $20, the bid should be around $0.10. The adaptive bid price helps sellers win high purchase-intent traffic and saves budget on low purchase-intent traffic.
Better PC-CVR through Machine Learning
To power our new contextual bidder, we adopted learning-to-rank, a widely applied information-retrieval methodology that uses machine learning to create predictive rankings.
On each searchad request, our system collects the top-N matched candidate listings and predicts a post-click conversion rate for each of them. These predictions are combined with each listing’s expected order value to generate per-listing bids, and the top k results are sorted and returned. Any clicked ads are fed back into our system as training data, with a target label attached that indicates whether or not those ads led to a purchase.
The first version of our new bidding system to reach production used a gradient-boosted decision tree (GBDT) model to do PC-CVR prediction. This model showed substantial improvements over the baseline non-contextual bidding system.
After we launched with our GBDT-based contextual bidder, we developed an even more accurate model that used a neural network (NN) for PC-CVR prediction. We trained a three-layer fully-connected neural network that used categorical features, numeric features, dense vector embeddings and pre-trained word embeddings fine-tuned to the task. The features are both contextual (e.g. time-of-day) and historical (e.g. 30-day listing purchase rate).
The figure displays our neural network architecture for a contextual post-click conversion rate prediction model. We use pre-trained language models to generate average embeddings over text fields such as query and title. An embedding layer encodes sparse features like shop ids into shop embeddings. Numeric features are z-score normalized and categorical features are one-hot encoded. The sigmoid output produces a probability of conversion given the clicked ad context.
Evaluating Post-Click Conversion Rate
It’s not always easy to evaluate the quality of the predictions of a machine-learning model. The metrics we really care about are “online” metrics such as click-through-rate (CTR) or return-on-ad-spend (ROAS). An increase in CTR indicates that we are showing our buyers listings that are more relevant to them. An increase in ROAS indicates that we are giving our sellers a better return on the money they spend with us.
The only way to measure online metrics is to do an A/B test. But these tests take a long time to run, so it is beneficial to have a strong offline evaluation system that allows us to gain confidence ahead of online experimentation. From a product perspective, offline evaluation metrics can be used as benchmarks to guide roadmaps and ensure each incremental model improvement translates to business objectives.
For our post-click conversion rate model, we chose to monitor Precision-Recall AUC (PR AUC) and ROC AUC. The difference between PR AUC and ROC AUC is that PR AUC is more sensitive to the improvements for the positive class, especially when the dataset is extremely imbalanced. In our case, we use PR AUC as our primary metric since we did not downsample the negative examples in our training data.
Below we’ve summarized the relative improvements of both models over the baseline non-contextual system.
The GBDT model outperformed the non-contextual model on both PR AUC (+105.39%) and ROC AUC (+14.59%). These results show that the contextual features added an important signal to the PC-CVR prediction. The large increase in PR AUC indicates that the improved performance was particularly dramatic for the purchase class.
The online experimental results were also promising. All parties (buyers, sellers and Etsy) benefited from this new bidding system. The buyers saw more relevant results from ads (+6.19% CTR), the buyers made more purchases from ads (+3.08% PCCVR) and the sellers received more sales for every dollar they spent on Etsy Ads (+7.02% ROAS).
The neural network model saw even stronger offline and online results, validating our hypothesis that PR AUC was the most important offline metric to track and reaffirming the benefit of continuing to improve our PC-CVR model.
Transforming Post-Click Conversion Predictions to Bid Prices
We were very pleased with the offline performance we were able to get from our PC-CVR model, but we quickly hit a roadblock when we tried to turn our PC-CVR scores into bid prices.
As the figure on the left shows, the distribution of predicted PC-CVR scores is right skewed and has a long tail. This distribution of scores is not surprising, given that our data is extremely imbalanced and sparse. But the consequence is that we cannot use the raw PC-CVR scores to generate our bid prices because the PC-CVR component would dominate the bidding system. We do not want such gigantic variances in bid prices and we never want to charge our sellers above $2 for a click.
The graph on the left is the distribution of scores from our PC-CVR model. The distribution on the right is the transformed distribution of scores that we used to generate bid prices.
To solve these issues, we refitted the logit of predicted PC-CVR with a smoother logistic regression. Below is the proposed function:
where x is predicted post-click conversion rate, k and b are hyperparameters. Note that if k=1 and b=0, this function is the identity function.
We set the k and b parameters such that the average bid price was unchanged from the old system to the new system. While we did not guarantee that each seller’s bid prices would remain consistent, at least we could guarantee that the average bid prices would not change. The final distribution is shown in the right-hand graph above.
One topic we didn’t dive into in this post is the difficulty of experimenting with our bidding system. For one thing, real money is involved whenever we run an A/B test which means that we need to be very careful about how we adjust the system. Nor is it possible to truly isolate the variants of an A/B test because all of the variants use the same pool of available budget. And once we changed our bidding system to use a contextual model, we moved from having to make batch predictions once a day to having to make online predictions up to 12,000 times a second.
We’re planning a follow-up post on some of the machine-learning infrastructure challenges we faced when scaling this system. But for now we’ll just say that the effort was worth it. We took on a lot of responsibility two years ago when we moved all our sellers to our automated bidding system, but we can confidently say that even the most sophisticated of them (and we have some very sophisticated sellers!) are benefitting from our improved bidding system.
For Etsy, 2020 was a year of unprecedented and volatile growth. (Simply on a human level it was also a profoundly tragic year, but that’s a different article.) Our site traffic leapt up in the second quarter, when lockdowns went into widespread effect, by an amount it normally would have taken several years to achieve.
When we looked ahead to the holiday season, we knew we faced real uncertainty. It was difficult to predict how large-scale economic, social and political changes would affect buyer behavior. Would social distancing accelerate the trends toward online shopping? Even if our traffic maintained a normal month-over-month growth rate from August to December, given the plateau we had reached, that would put us on course deep into uncharted territory.
Graph of weekly traffic for 2016-2020. This was intimidating!
If traffic exceeds our preparation, would we have enough compute resources? Would we hit some hidden scalability bottlenecks? If we over-scaled, there was a risk of wasting money and spoiling environmental resources. If we under-scaled, it could have been much worse for our sellers.
For context about Etsy, as of 2020 Q4 we had 81 million active buyers and over 85 million items for sale.
Modulating Our Pace of Change
When we talk about capacity, we have to recognize that ultimately Etsy’s community of sellers and employees are the source of it. The holiday shopping season is always high-stakes so we adapt our normal working order each year so that we don’t over-burden that capacity.
Many sellers operate their businesses at full stretch for weeks during the season. It’s unfair to ask them to adapt to changes at a time when they’re already maxed out.
Our Member Services support staff are busy when our sellers are busy. If we push changes that increase the demand for support, we undermine our own efforts to provide excellent care.
Our infrastructure teams are on alert to scale our systems against novel peaks and to maintain overall system health. Last-minute changes could easily compromise their efforts.
In short, we discourage, for a few weeks, deploying changes that might be expected to disrupt sellers, support, or infrastructure. Necessary changes can still get pushed, but the standards are higher for preparation and communication. We call this period “Slush” because it’s not a total freeze and we’ve written about it before.
For more than a decade, we have maintained Slush as an Etsy holiday tradition. Each year, however, we iterate and adapt the guidance, trying to strike the right balance of constraints. It’s a living tradition.
Modeling History To Inform Capacity Planning
Even though moving to Google Cloud Platform (GCP) in 2018 vastly streamlined our preparations for the holidays, we still rely on good old-fashioned capacity planning. We look at historical trends and system resource usage and communicate our infrastructure projections to GCP many weeks ahead of time to ensure they will have the right types of machines in the right locations. For this purpose, we share upper-bound estimates, because this may become the maximum available to us later. We err on the side of over-communicating with GCP.
As we approached the final weeks of preparation and the ramp up to Cyber Monday, we wanted to understand whether we were growing faster or slower than a normal year. Because US Thanksgiving is celebrated on the fourth Thursday of November rather than a set date, it moves around in the calendar and it takes a little effort to perform year-to-year comparisons. My teammate Dany Daya built a simple model that looked at daily traffic but normalized the date using “days until Cyber Monday.”
This would come in very handy as a stable benchmark of normal trends when customer patterns shifted unusually.
Adapting Our “Macro” Load Testing
Though we occasionally write synthetic load tests to better understand Etsy’s scalability, I’m generally skeptical about what we can learn using macro (site-wide) synthetic tests. When you take a very complex system and make it safe to test at scale, you’re often looking at something quite different from normal operation in production. We usually get the most learning per effort by testing discrete components of our site, such as we do with search.
Having experienced effectively multiple years of growth during Q2, and knowing consumer patterns could change unexpectedly, performance at scale was now an even bigger concern. We decided to try a multi-team game day with macro load testing and see what we could learn. We wanted an open-ended test that could help expose bottlenecks as early as possible.
We gathered together the 15-plus teams that are primarily responsible for scaling systems. We talked about how much traffic we wanted to simulate, asked about how the proposed load tests might affect their systems and whether there were any significant risks.
The many technical insights of these teams deserve their own articles. Many of their systems can be scaled horizontally without much fuss, but the work is a craft: predicting resource requirements, anticipating bottlenecks, and safely increasing/decreasing capacity without disrupting production. All teams had at least one common deadline: request quota increases from GCP for their projects by early September.
We were ready to practice together as part of a multi-team “scale day” in early October. We asked everyone to increase their capacity to handle 3x the traffic of August and then we ran load tests by replaying requests (our systems served production traffic plus synthetic replays). We gradually ramped up requests, looking for signs of increased latency, errors, or system saturation.
While there are limitations to what we can learn from a general site-wide load test, scale day helped us build confidence.
We confirmed all projects had enough quota from GCP
We confirmed our many scaling tools worked as intended (Terraform, GKE, instance group managers, Chef, etc.)
We exposed bottlenecks in the configuration of some components, for example some Memcache clusters and our StatsD relays, both of which were quickly addressed
But crucially, we confirmed many systems looked like they could handle scale beyond what we expected at the peak of 2020.
Cresting The Peak
Let’s skip ahead to Cyber Monday, which is typically our busiest day of the year. Throughput on our sharded MySQL infrastructure peaked around 1.5 million queries per second. Memcache throughput peaked over 20M requests per second. Our internal http API cluster served over 300k requests per second.
Normally, no one deploys on Cyber Monday. Our focus is on responding to any emergent issues as quickly as possible. But 2020 threw us another curve: postal service interruptions meant that our customers were facing widespread package delivery delays. It only needed a small code change to better inform our buyers about the issue, but we’d be deploying it at the peak hour of our busiest day. And since it would be the first deployment of that day, the entire codebase would need to be compiled from scratch, in production, on more than 1000 hosts.
We debated waiting till morning to push the change, but that wouldn’t have served our customers, and we were confident we could push at any time. Still, as Todd Mazierski and Anu Gulati began the deploy, we started nominating each other for hypothetical Three Armed Sweaters Awards. But the change turned out to be sublimely uneventful. We have been practicing continuous deployment for more than a decade. We have invested in making deployments safe and easy. We know our tools pretty well and we have confidence in them.
We have long maintained a focus on scalability at Etsy, but we all expected to double traffic over a period of years, not just a few months. We certainly did not expect to face these challenges while working entirely distributed during a pandemic.
We made it to Christmas with fewer operational issues than we’ve experienced in recent memory. I think our success in 2020 underscored some important things about Etsy’s culture and technical practices.
We take pride in operational excellence, meaning that every engineer takes responsibility not just for their own code, but for how it actually operates in production for our users. When there is an outage, we always have more than enough experts on hand to mitigate the issue quickly. When we hold a Blameless Postmortem, everyone shares their story candidly. When we discover a technical or organization liability, we try to acknowledge it openly rather than hide it. All of this helps to keep our incidents small.
Our approach to systems architecture values long-term continuity, with a focus on a small number of well-understood tools, and that provided us the ability to scale with confidence. So while 2020 had more than its share of surprising circumstances, we could still count on minimal surprises from our tools.
The Etsy marketplace brings together shoppers and independent sellers from all over the world. Our unconventional inventory presents unique challenges for product search, given that many of our listings fall outside of standard e-commerce categories. With more than 80 million listings and 3.7 million sellers, Etsy relies on machine learning to help users browse creative, handmade goods in their search results. But what if we could make the search experience even better with results tailored to each user? Enter personalized search results.
Search results for “tray”
(above) default results, (below) a user who recently interacted with leather goods
When a user logs into the marketplace and searches for items, they signal their preferences by interacting with listings that pique their interest. In personalization, our algorithms train on these signals and learn to predict, per user, the most relevant listings. The resulting personalized model lets individuals’ taste shine through in their search results without compromising performance. Personalization enhances the underlying search model by customizing the ordering of relevant items according to user preference. Using a combination of historical and contextual features, the search ranking model learns to recognize which items have a greater alignment with an individual’s taste.
In the following sections, we describe the Etsy search architecture and pipeline, the features we use to create personalized search results, and the performance of this new model. Finally, we reflect on the challenges from launching our first personalized search model and look ahead to future iterations of personalization. Please note that some sections of the post are more technical and assume knowledge of machine learning basics from the reader.
Etsy Search Architecture
The search pipeline is separated into two passes: candidate set retrieval and ranking. This ensures that we are returning low-latency results — a crucial component of a good search system. Because the latter ranking step is computationally expensive, we want to use it wisely. So from millions of total listings, the candidate set retrieval step selects the top one thousand items for a query by considering tags, titles, and other seller-provided attributes. This allows us to run the ranking algorithm over less than one percent of all listings. In the end, the ranker will place the most relevant items at the top of the search results page.
Our search ranking algorithm is an ensemble model that uses a gradient boosted decision tree with pairwise formulation. For personalization, we introduce a new set of features that allow us to model user preferences. These features are included in addition to existing features, such as listing price and recency. And as much as we try to avoid impacting latency, the introduction of these personalized features creates new challenges in serving online results. Because the features are specific to each user, the cache utilization rate drops. We address this and other challenges in a later section.
Personalized User Representations
The novelty of personalized search results lies in the new features we pass to the ranker. We categorize personalization features into two groups: historical and contextual features.
Historical features refer to singular data points about a user’s profile that can succinctly describe shopping habits and behaviors. Are they modern consumers of digital goods, or are they hunting and gathering vintage pieces? Do they carefully deliberate on each individual purchase, or do they follow their lightning-strike gut instinct? We can gather these insights from the number of digital or vintage items purchased and average number of site visits. Historical user features help us put the “person” in personalization.
Search results for “lamp”
(above) default results, (below) a user who recently interacted with epoxy resin items
In contrast to these numerical features, data can also be represented as a vector, or a list of numbers. For personalization, we refer to these vectored features as contextual features because the listing vector represents a listing with respect to the context of all other listings. In fact, there are many ways to represent a listing as a vector but we use term frequency–inverse document frequency (Tf-Idf), item-interaction embeddings and interaction-based graph embeddings. If you’re unfamiliar with any of these methods, don’t worry! We’ll be diving deeper into the specific vector generation algorithms.
So how do we capture a user’s preferences from a bunch of listing vectors? One method is to average all the listings a user has clicked on to represent them. In other words, the user contextual vector is simply the average of all the interacted listings’ contextual vectors.
We gather historical and contextual features from across our mobile web, desktop and mobile application platforms. This allows us to maximize the amount of information our model can use to personalize search result rankings.
The Many Ways Users Show Us Love
In addition to clicks on a listing from search results, a user has a few other ways to connect with sellers’ items on the marketplace. After a user searches for an item, they can favorite items in the search results page and save them to their own curated collections, they can add an item to their cart while they continue to browse, and once they are satisfied with their selection they can purchase the item.
Each of these interactions has distinct characteristics which help our model generalize and generate more accurate predictions. Clicks are by far the most popular way for buyers to engage with listings, and through sheer quantity provide for a rich source of material to model user behaviors. On the other end, purchase interactions occur less frequently than clicks but contain stronger indications of relevance of an item to the user’s search query.
The Heart of Personalization
Now, let’s get to the crux of personalization and dig deeper into user contextual features.
Tf-Idf vectors consider listings from a textual standpoint, where words in the seller-provided attributes are weighted according to their importance. These attributes include listing titles, tags, and others. Each word’s importance is derived with respect to its frequency in the immediate listing text, but also the larger corpus of listing texts. This allows us to distinguish a listing from others and capture its unique qualities. When we average the last few months’ worth of listings a user has interacted with, we are averaging the weights of words in those listings to create a single Tf-Idf vector and represent the user. In other words, in Tf-Idf a listing is represented by its most important listing words and a user is represented as an average of those most important words.
Diagram of interaction-based graph embedding
In this example of interaction-based graph embeddings, the queries “dollhouse” and “dolls” resulted in clicks on listing 824770513 on three and five occasions, respectively.
Unlike Tf-Idf, an interaction-based graph embedding can capture the larger interaction context of query and listing journeys. Recall that interactions can be as clicks, favorites, add-to-carts or purchases from a user. Let’s say we have a query and some listings that are often clicked with that query. When we match words within the query to the words in the listing texts and weigh the words common to both of them, we can represent listings and queries with the same vocabulary. A common vocabulary is an important quality in textual representations because we can derive degrees of relatedness between queries to listings despite differences in length and purpose. Therefore, if a few different listings are all clicked as a result of the same query, we expect the embeddings for these listings to be similar.
And similar to Tf-Idf, we can simply average the weights of words in the sparse vectors over some time frame. Whereas graph embeddings weave behavior from interaction logs into the vector representation, Tf-Idf only uses available listing text. Put more plainly, for graph embeddings users tell us which queries and listings are related and we model this information by finding overlaps between their words.
Diagram from Learning Item-Interaction Embeddings for User Recommendations
However, focusing on a single interaction type within an embedding can be limiting. In reality, users can have a combination of different interactions within a single shopping session. Item-interaction vectors can learn multiple interactions in the same space. Created by our very own data scientists here at Etsy, item-interaction vectors build upon the methods of word2vec where words occurring in the same context share a higher vector similarity.
The implementation of item-interaction vectors is simple but elegant: we replace words and sentences with item-interaction token pairs and sequences of interacted items. A token pair is formulated as (item, interaction-type) to represent how a user interacted with a specific item or listing. And an ordered list of these tokens represents the sequence of what and how a user interacted with various listings in a session. As a result, item-interaction token pairs that appear in the same context will be considered similar.
Because these listings embeddings are dense vectors, we can easily find similar listings via distance metrics. To summarize item-interaction vectors, similar to interaction-based graph embeddings we let the users guide us in learning which listings are similar. But rather than deriving listing similarities from query and listing relationships, we infer similarity if listings appear in the same sequences of interactions.
Putting It All Together
Let’s take stock of what we have to work with: recent or lifetime look back windows, three types of contextual features (Tf-Idf, graph embedding, item-interaction), and four types of user behavior interactions (click, favorite, add-to-cart, purchase). Mixing and matching these together, we have a grand total of 24 contextual vectors to represent a single user in order to rank items for personalized search results. For example, we can combine an overall time window, item-interaction method, and “favorite” interactions to generate an item-interaction vector that represents a user’s all-time favorite listings.
Search results for “necklace charms blue”
(above) default results, (below) a user who recently interacted with eye charms
In personalized search ranking, when a user enters a query we still do a coarse sweep of the inventory and grab top-related items to a query in candidate set retrieval. But in the ranking of items, we now include our new features. Recall that decision trees take input features in the form of integers or decimals.
To satisfy this requirement, we can pass user historical features straight through to the tree or create new features by combining them with other features beforehand. To include user contextual features in the ranking, we have to compute similarity metrics between users’ contextual vectors and the candidate listing vectors from the candidate retrieval step. We derive Jaccard similarity and token overlap for sparse vectors and cosine similarity for dense vectors. From these metrics we understand which candidate listings are more similar to listings a user has previously interacted with. However, this metric alone is not sufficient to determine a final ranking.
Decision trees take these inputs and learn how each feature impacts whether an item will be purchased. We feed user historical features, similarity measures, and other non-personalized features into the tree so it can learn to rank listings from most relevant to least. The expectation is that the most relevant listings are the items a user is more likely to purchase.
Personalized Search Performance
In online A/B experiments we compared this personalized model with a control and observed an improvement in ranking performance from a purchase’s normalized discounted cumulative gain (NDCG). NDCG captures the goodness of a ranking. If, on average, users purchase items ranked higher on the page, this ranking would have a high purchase NDCG. In our experiments, we observed that the NDCG for personalization was especially high for users that have recently and/or often interacted with the marketplace.
Search results for “print maxi dress”
(above) default results, (below) a user who recently interacted with African prints
Users also click around less in personalized results, index to fewer pages, and buy more items compared to the control model. This indicates that users are finding what they want faster with the personalized variant.
Overall, personalization features play an important role relative to the existing features for our decision tree. Generally speaking, the importance gain of a feature in a decision tree indicates how much a feature contributes to better predictions. In personalization, contextual features prove to be a strong influence in determining a good final ranking. For users with a richer history of interactions, we provide even better personalized results. This is confirmed by a greater number of purchases online and increased NDCG offline for users that have recently purchased items more than once. Vector representations from recent time windows had greater feature importance gain compared to lifetime vectors. This means that users’ recent interactions with listings give a better indication of what the user wants to purchase.
Out of the three user contextual feature types, the text-based Tf-Idf vectors tend to have higher feature importance gain. This might suggest that ranking items based on seller-provided attributes given a query is the best way to help users find what they are looking for.
We also identify users’ clicked and favorited items as more important signals compared to past cart adds or purchases. This might indicate that if a user purchased an item once, they have less utility for highly similar items later.
Challenges & Considerations
Mentioned in the beginning of the post, serving personalized results online for individual users introduces latency challenges. Typically we rank items in the second pass in real-time and rely on caching to save ordered rankings per query to reduce latency in online serving. But because personalized results are specific to individual users, we have to rank listings for every user and query pair. Cache utilization decreases due to the exploding amount of information we need to store in order to account for each personalized result, which can impact the user experience.
Understanding further the appropriate situations to deploy personalized search results can improve a user’s experience. For example, when shopping for gifts for your grandmother, would we really want to tailor the results to your taste preferences? There are many ways to achieve this, such as learning the degree of personalization with respect to the query but there might be tradeoffs to latency and training time.
For new users who haven’t interacted with many listings on our marketplace, we can have an alternative approach for populating default vector values. One possible method would be to set the default vector for new users as the average of all user vectors. However, it’s been shown that personalized results are better when a user’s individual preferences are very different from the group’s average preference.
Personalization should take into account users’ privacy choices. This can take many forms, such as removing personalization information over time, providing user preferences as to personalization, anonymizing personalization information, or considering the sensitivity of information in how it is used for personalization.
In this post, we have covered how Etsy achieves personalized search ranking results. Our models learn which listings should rank higher for a user based on their own history as well as others’ history. These features are encapsulated with user historical features and contextual features. Since launching personalization, users have been able to find items they liked more easily, and they often come back for more. At Etsy, we’re focused on connecting our vibrant marketplace of 3.7 million sellers with shoppers around the world. With the introduction of personalized ranking, we hope to maintain course in our mission to keep commerce human.
Etsy recently launched Dark Mode in our iOS and Android buyer apps. Since Dark Mode was introduced system-wide last year in iOS 13 and Android 10, it has quickly become a highly requested feature by our users and an industry standard. Benefits of Dark Mode include reduced eye strain, accessibility improvements, and increased battery life. For the Design Systems team at Etsy, it was the perfect opportunity to test the limits of the new design system in our apps.
In 2019, we brought Etsy’s design system, Collage, to our iOS and Android apps. Around the same time, Apple announced Dark Mode in iOS 13. By implementing Dark Mode, the Design Systems team could not only give users the flexibility to customize the appearance of the Etsy app to match their preferences, but also test the new UI components app-wide and increase adoption of Collage in the process.
Without semantic colors, Dark Mode wouldn’t have been possible. Semantic colors are colors that are named relative to their purpose in the UI (e.g. primary button color) instead of how they look (e.g. light orange color). Collage used a semantic naming convention for colors from the beginning. This made it relatively easy to support dynamic colors, which are named colors that can have different values when switching between light and dark modes. For example, a dynamic primary text color might be black in light mode and white in Dark Mode.
Dynamic semantic colors opened up the possibility for Dark Mode, but they also led to a more accessible app for everyone. On iOS, we also added support for the Increased Contrast accessibility feature which increases the contrast between text and backgrounds to improve legibility. Any color in the Etsy app can now have up to four values for light/dark modes and regular/increased contrast.
To streamline the process for adding new colors, we created a script on iOS that generates all of our color assets and files. With the growing complexity of dynamic colors, having a single source of truth for color definitions is important. On iOS, our source of truth is a property list (a key-value store for app data) of color names and values. We created a script that automatically runs when the app is built and generates all the files we need to represent colors: an asset catalog, convenience variables for accessing colors in code, and a source file for the unit tests. Adding a new color is as simple as adding a line to the property list, and all the relevant files are updated for you. This approach has reduced the time it takes to add a new color and eliminated the risk of inconsistencies across the codebase.
Another design change we made for Dark Mode was rethinking how we represent elevation in our UI components. In light mode, it’s common to add a shadow around your view or dim the background to show that one view is layered above another. In Dark Mode, these approaches aren’t as effective and the platform convention is to slightly lighten the background of your view instead. The Etsy app uses shadows and borders extensively to indicate levels of elevation. For Dark Mode, we removed shadows entirely and used borders much more sparingly. Instead, we followed iOS and Android platform conventions and introduced elevated background colors into our design system. Semantic colors came to the rescue again and we were easily able to use our regular background color in light mode while applying a lighter color in Dark Mode on views that needed it, such as listing cards.
Choose your theme
There is no system-level Dark Mode setting available on older versions of Android, but it can be enabled for specific apps that support Themes (an Android feature that allows for UI customization and provides the underlying structure for Dark Mode). This limitation turned into an opportunity for us to provide more customization options for all our users. In both our iOS and Android apps you can personalize the appearance of the Etsy app to your preferences. So if you want to keep your phone in light mode but the Etsy app needs to be easy on the eyes for late night shopping, we’ve got you covered.
Dark Mode in web views
Another obstacle to overcome was our use of web views, a webpage that is displayed within a native app. Web views are used in a handful of places in our iOS and Android apps, and we knew that for a great user experience they needed to work seamlessly in Dark Mode as well. Thankfully, the web engineers on the Design Systems team jumped in to help and devised a solution to this problem. Using the Sass !default syntax for variables, we were able to define default color values for light mode. Then we added Dark Mode variable overrides where we defined our Dark Mode colors. If the webpage is being viewed from within the iOS or Android app with Dark Mode enabled, we load the Dark Mode variables first so the default (light mode) color variables aren’t used because they’ve already been defined for Dark Mode. This approach is easy to maintain and performant, avoiding a long list of style overrides for Dark Mode.
At first this seemed like a simple question. Our legacy test said:
And the migrated Jest test said:
So weren’t they doing the same thing? Maybe. What if one checked shallow equality and the other checked deep equality? And what about assertions that had no corollaries in Jest? Our legacy suite relied on jasmine.ajax and jasmine-jquery, and we would need to propose alternatives for both modules when migrating our tests. All of this opened the door for subtle variations to creep in and make the difference between catching and missing a bug. We could have spent our time poring through the source code of Jasmine and Jest to figure out if these differences really existed, but instead we decided to use Mutation Testing to find them for us.
What is Mutation Testing?
Stryker’s default reporter even displays how it generated the mutants that survived so it’s easy to identify the gaps in the suite. In this case, two Conditional Expression mutants and a Logical Operator mutant survived. All together, Stryker supports roughly thirty possible mutation types, but that list can be whittled down for faster test runs.
Since our hypothesis was that the implementation differences between Jasmine and Jest could affect the Mutation Score of our legacy and new test suites, we began by cataloging every bit of Jasmine-specific syntax in our legacy suite. We then compiled a list of roughly forty test files that we would target for Mutation Testing in order to cover the full syntax catalog. For each file we generated a Mutation Score for its legacy state, converted it to run in our new Jest setup, and generated a Mutation Score again. Our hope was that the new Jest framework would have a Mutation Score as good as or better than our legacy framework.
By limiting the scope of our test to just a few dozen files, we were able to run all mutations Stryker had to offer within a reasonable timeframe. However, the sheer size of our codebase and the sprawling dependency trees in any given feature presented other challenges to this work. As I mentioned before, Stryker copies the source code to be mutated into separate sandbox directories. By default, it copies the entire project into each sandbox, but that was too much for Node.js to handle in our repository:
Stryker allows users to configure an array of files to copy over instead of the entire codebase, but doing so would require us to know the full dependency tree of each file that we hoped to test ahead of time. Instead of figuring that out by hand, we wrote a custom Jest file resolver specifically for our Stryker testing environment. It would attempt to resolve source files from the local directory structure, but it wouldn’t fail immediately if they weren’t found. Instead, our new resolver would reach outside of the Stryker sandbox to find the file in the original directory structure, copy it into the sandbox, and re-initiate the resolution process. This method saved us time for files that had very expansive dependency trees. With that in hand, we pressed forth with our experiment.
Ultimately we found that our new Jest framework had a worse Mutation Score than our legacy framework.
It’s true. On average, tests run by our legacy framework received a 55.28% Mutation Score whereas tests run by our new Jest framework received a 54.35%. In one of our worst cases, the legacy test earned a 35% while the Jest test picked up a measly 16%.
Analyzing The Result
Once we began seeing lower Mutation Scores on a multitude of files, we put a hold on the migration to investigate what sort of mutants were slipping past our new suite. It turned out that most of what our new Jest suite failed to catch were String Literal mutations in our Asynchronous Module Definitions:
We dug into these failures further and discovered that the real culprit was how the different test runners compiled our code. Our legacy test runner was custom built to handle Etsy’s unique codebase and was tightly coupled to the rest of our infrastructure. When we kicked off tests it would locate all relevant source and test files, run them through our actual webpack build process, then load the resulting code into PhantomJS to execute. When webpack encountered empty strings in the dependency definition it would throw an error and halt the test, effectively catching the bug even if there were no tests that actually relied on that dependency.
Jest, on the other hand, was able to bypass our build system using its file resolver and a handful of custom mappings and transformers. This was one of the big draws of the migration in the first place; decoupling the tests from the build process meant they could execute in a fraction of the time. However, the module we used in Jest to manage dependencies was much more lenient than our actual build system, and empty strings were simply ignored. This meant that unless a test actually relied on the dependency, our Jest setup had no way to alert the tester if it was accidentally left out. Ultimately we decided that this sort of bug was acceptable to let slide. While it would no longer be caught during the testing phase, the code would still be rejected by the build phase of our CI pipeline, thereby preventing the bug from reaching Production.
As we proceeded with the migration we encountered a handful of other cases where the Mutation Scores were markedly different, one of which is particularly notable. We happened upon an asynchronous test that used a done() callback to signify when the test should exit. The test was malformed in that there were twodone() callbacks with assertions between them. In Jest this was no big deal; it happily executed the additional assertions before ending the test. Jasmine was much more strict though. It stopped the test immediately when it encountered the first callback. As a result, we saw a significant jump in Mutation Score because mutants were suddenly being caught by the dangling assertions. This validated our suspicion that implementation differences between Jasmine and Jest could affect which bugs were caught and which slipped through.
The Future of Mutation Testing at Etsy
While it may not be feasible for Etsy to test every possible mutant in the codebase, we can still gain insights about our testing practices by applying Mutation Testing at a more granular level. A manageable approach would be to generate a Mutation Score automatically any time a pull request is opened, and focus the testing on only the files that changed. Posting that information on the pull request could help us understand what conditions will cause our unit tests to fail. It’s easy to write an overly-lenient test that will pass no matter what, and in some ways that’s more dangerous than having no test at all. If we only look at Code Coverage, such a test boosts our numbers giving us a false sense of security that bugs will be caught. Mutation Score forces us to confront the limitations of our suite and encourages us to test as effectively as possible.
This article draws on our published paper in KDD 2020 (Oral Presentation, Selection Rate: 5.8%, 44 out of 756)
It is common in the internet industry to develop algorithms that power online products using historical data. An algorithm that improves evaluation metrics from historical data will be tested against one that has been in production to assess the lift in key performance indicators (KPIs) of the business in online A/B tests. We refer to metrics calculated using new predictions from an algorithm and historical ground truth as offline evaluation metrics. In many cases, offline evaluation metrics are different from business KPIs. For example, a ranking algorithm, which powers search pages on Etsy.com, typically optimizes for relevance by predicting purchase or click probabilities of items. It could be tested offline (offline A/B tests) for rank-aware evaluation metrics, for example, normalized discounted cumulative gain (NDCG), mean reciprocal rank (MRR) or mean average precision (MAP), which are calculated using predicted ranks of ranking algorithms on the test set of historical purchase or click-through feedback of users. Most e-commerce platforms, however, deem sitewide gross merchandise sale (GMS) as their business KPI and test for it online. There could be various reasons not to directly optimize for business KPIs offline or use business KPIs as offline evaluation metrics, such as technical difficulty, business reputation, or user loyalty. Nonetheless, the discrepancy between offline evaluation metrics and online business KPIs poses a challenge to product owners because it is not clear which offline evaluation metric, among all available ones, is the north star to guide the development of algorithms in order to optimize business KPIs.
The challenge essentially asks for the causal effects of increasing offline evaluation metrics on business KPIs, for example how business KPIs would change for a 10% increase in an offline evaluation metric with all other conditions remaining the same (ceteris paribus). The north star should be the offline evaluation metric that has the greatest causal effects on business KPIs. Note that business KPIs are impacted by numerous factors, internal and external, especially macroeconomic situations and sociopolitical environments. This means, just from changes in offline evaluation metrics, we have no way to predict future business KPIs. Because we are only able to optimize for offline evaluation metrics to affect business KPIs, we try to infer the change in business KPIs, given all other factors, internal and external, unchanged, from changes in our offline evaluation metrics, based on historical data. Our task here is causal inference rather than prediction.
Our approach is to introduce online evaluation metrics, the online counterparts of offline evaluation metrics, which measure the performance of online products (see Figure 1). This allows us to decompose the problem into two parts: the first part is the consistency between changes of offline and online evaluation metrics, the second part is the causality between online products (assessed by online evaluation metrics) and the business (assessed by online business KPIs). The first part is solved by the offline A/B test literature through counterfactual estimators of offline evaluation metrics. Our work focuses on the second part. The north star should be the offline evaluation metric whose online counterpart has the greatest causal effects on business KPIs. Hence, the question becomes how business KPIs would change for a 10% increase in an online evaluation metric ceteris paribus.
Figure 1: The Causal Path from Algorithm Trained Offline to Online Business Note: Offline algorithms powers online products, and online products contribute to the business.
Why do we focus on causality? Before answering this question, let’s think about another interesting question: thirsty crow vs. talking parrot, which one is more intelligent (see Figure 2)?
In Aesop’s Fables, a thirsty crow found a pitcher with water at the bottom. The water is beyond the reach of its beak. It intentionally dropped pebbles into the pitcher, which caused the water to rise to the top. A talking parrot cannot really talk. After being fed a simple phrase tons of times (big data and machine learning), it can only mimic the speech without understanding its meaning.
The crow is obviously more intelligent than the parrot. The crow understood the causality between dropped pebbles and rising water and thus leveraged the causality to get the water. Beyond big data and machine learning (talking parrot), we want our artificial intelligence (AI) system to be as intelligent as the crow. After understanding the causality between evaluation metric lift and GMS lift, our system can leverage the causality, by lifting the evaluation metric offline, to achieve GMS lift online (see Figure 3). Understanding and leveraging causality are key topics in current AI research (see, e.g., Bergstein, 2020).
Figure 3: Understanding and Leveraging Causality in Artificial Intelligence
Causal Meta-Mediation Analysis
Online A/B tests are popular to measure the causal effects of online product change on business KPIs. Unfortunately, they cannot directly tell us the causal effects of increasing offline evaluation metrics on business KPIs. In online A/B tests, in order to compare the business KPIs caused by different values of an online evaluation metric, we need to fix the metric at its different values for treatment and control groups. Take the ranking algorithm as an example. If we could fix online NDCG of the search page at 0.22 and 0.2 for treatment and control groups respectively, then we would know how sitewide GMS would change for a 10% increase in online NDCG at 0.2 ceteris paribus. However, this experimental design is impossible, because most online evaluation metrics depend on users’ feedback and thus cannot be directly manipulated.
We address the question by developing a novel approach: causal meta-mediation analysis (CMMA). We model the causality between online evaluation metrics and business KPIs by dose-response function (DRF) in potential outcome framework. DRF originates from medicine and describes the magnitude of the response of an organism given different doses of a stimulus. Here we use it to depict the value of a business KPI given different values of an online evaluation metric. Different from doses of stimuli, values of online evaluation metrics cannot be directly manipulated. However, they could differ between treatment and control groups in experiments of treatments other than algorithms: user interface/user experience (UI/UX) design, marketing, and etc. This could be due to the “fat hand” nature of online A/B tests that a single intervention can change many causal variables at once. A change of the tested feature, which is not an algorithm, could induce users to change their engagement with algorithm-powered online products, so that values of online evaluation metrics would change. For instance, in an experiment of UI design, users might change their search behaviors because of the new UI design, so that values of online NDCG, which depend on search interaction, would change even though the ranking algorithm does not change (see Figure 4). The evidence suggests that online evaluation metrics could be mediators that partially transmit causal effects of treatments on business KPIs in experiments where treatments are not necessarily algorithm-related. Hence, we formalize the problem as the identification, estimation, and testing of mediator DRF.
Figure 4: Directed Acyclic Graph of Conceptual Framework
Our novel approach CMMA combines mediation analysis and meta-analysis to solve for mediator DRF. It relaxes common assumptions in causal mediation literature: sequential ignorability (in linear structural equation model) or complete mediation (in instrumental variable approach) and extends meta-analysis to solve causal mediation while the meta-analysis literature only learns the distribution of average treatment effects. We did extensive simulations, which show CMMA’s performance is superior to other methods in the literature in terms of unbiasedness and the coverage of confidence intervals. CMMA uses only experiment-level summary statistics (i.e., meta-data) of many existing experiments, which makes it easy to implement and to scale up. It can be applied to all experimental-unit-level evaluation metrics or any combination of them. Because it solves the causality problem of a product by leveraging experiments of all products, CMMA could be particularly useful in real applications for a new product that has been shipped online but has few A/B tests.
We apply CMMA on the three most popular rank-aware evaluation metrics: NDCG, MRR, and MAP, to show, for ranking algorithms that power search products, which one has the greatest causal effect on sitewide GMS.
User-Level Rank-Aware Metrics
We redefine the three rank-aware metrics (NDCG, MAP, MRR) at the user level. The three metrics are originally defined at the query level in the test collection evaluation of information retrieval (IR) literature. Because the search engine on Etsy.com is an online product for users, the computation could be adapted to the user level. We include search sessions of no interaction or no feedback into the metric calculation in accordance with online business KPI calculation in online A/B tests that always includes visits/users of no interaction or no feedback. Specifically, the three metrics are constructed as follows:
Query-level metrics are computed using rank positions on the search page and user conversion status (binary relevance). Queries of non-conversion have zero values.
User-level metric is the average of query-level metrics across all queries the user issues (including non-conversion associated queries). Users who do not search or convert have zero values.
All three metrics are defined at rank position 48, the lowest position of the first page of search results in Etsy.com.
To demonstrate CMMA, we randomly selected 190 experiments from 2018 and implemented CMMA on summary results of each experiment (e.g., the average user-level NDCG per user in treatment and control groups). Figure 5 shows results from CMMA. The vertical axis indicates elasticity, the percentage change of average GMS per user for a 10% increase in an online rank-aware metric with all other conditions that can affect GMS remaining the same. Lifts in all the three rank-aware metrics have positive causal effects on the average GMS per user. These don’t have the same performance; different values of different rank-aware metrics have different elasticities. For example, suppose current values of NDCG, MAP, and MRR of search page are 0.00210, 0.00156, and 0.00153 respectively, then a 10% increase in MRR will cause higher lifts in GMS than 10% increases in the other two metrics ceteris paribus, which is indicated by red dashed lines, and thus MRR should be the north star to guide our development of algorithms. Because all the three metrics have the same input data, they are highly correlated and thus the differences are small. As new IR evaluation metrics are continuously developed in the literature, we will implement CMMA for more comparison in the future.
Figure 5: Elasticity from Average Mediator DRF Estimated by CMMA Note: Elasticity means the percentage change of average GMS per user for a 10% increase in an online rank-aware metric with all other conditions that can affect GMS remaining the same.
We implement CMMA to identify the north star of rank-aware metrics for search-ranking algorithms. It is easy to implement and to scale up. It has helped product and engineering teams to achieve efficiency and success in terms of business KPIs in online A/B tests. We published CMMA on KDD 2020. We also made a fun video describing the intuition behind CMMA. Interested readers can refer to our paper for more details and our GitHub repo for the analysis code.
Etsy recently released a feature in our buyer-facing iOS app that allows users to visualize wall art within their environments. Getting the context of a personal piece of art within your space can be a meaningful way to determine whether the artwork will look just as good in your room as it does on your screen. The new feature uses augmented reality to bridge that gap, meshing the virtual and real worlds. Read on to learn how we made this possible using aspects of machine learning and computer vision to present the best version of Etsy sellers’ artwork in augmented reality. It didn’t even require a PhD.-level education or an expensive 3rd party vendor – we did it all with tools provided by iOS.
Building a Chain
Using Computers to See
Early in 2019, I put together a quick proof of concept that allowed for wall art to be displayed on a vertical plane, which required a standalone image of the artwork filling the entire image. Oftentimes, though, Etsy sellers upload images that show their item in context, like on a living room wall, to show scale. This complicates the process because these listing images can’t be placed onto vertical planes in augmented reality as-is; they need to be reformatted and cropped.
Two engineers, Chris Morris and Jake Kirshner developed a solution that used computer vision to find a rectangle within an image, perhaps a frame, and crop the image for use. Using the Vision framework in iOS, they were able to pull out the artwork we needed to place in 3D space. We found that trying to detect only one rectangle, as opposed to all, created performance wins and gave us the shape with greatest confidence by the system. Afterwards, we used Core Image in order to crop the image, adjusting for any perspective skew that might be present. Apple has an example using a frame buffer but can be applied to any UIImage.
To Crop or Not to Crop
As I mentioned before, some Etsy sellers upload their artwork as standalone images, while others depict their artwork in different environments. We wanted to present the former as-is, and we needed to crop the latter, but we had no way to automatically categorize the more than 5 million artwork listings available on our marketplace.
To solve this, we used on-device machine learning provided by Core ML. The team sifted through more than 1,200 listings and sorted the images by those that should be cropped and those that should not be cropped. To create the machine learning model, we first used an iOS Playground and, later, a Mac application called Create ML. The process was as easy as dropping a directory with two subdirectories filled with correct images, “no_frames” and “frames”, into the application along with a corresponding smaller set of different images used to test the resulting model. Once this model was created and verified, we used VNCoreMLRequest to check a listing’s image and determine whether we should crop it or present it as-is. This type of model is known as image classification.
We also investigated a different type of mode called object detection, which finds the existence and coordinates of a frame within an image. This technique had two downsides: training the model required laborious manual object marking for each image provided, and the resulting model, which would be included in our app bundle, would be well over 60mb vs. the 15kb model for image classification. That’s right, kilobytes.
Translating Two Dimensions to Three
Once we had the process for determining whether the image needs to be reformatted, we used a combination of iOS’ SceneKit and ARKit to place the artwork as a material on a rudimentary shape. With Apple focusing heavily on this space, we were able to find plenty of great examples and tutorials to get us started with augmented reality on iOS. We started with the easy-to-use RealityKit framework, but the iOS 13-only restriction was a blocker as we supported back to iOS 11 at the time.
The implementation in ARKit was relatively straightforward, technically, but working for the first time in 3D space vs. a flat screen, it was a challenge to develop a vocabulary and way of thinking about the physical space being altered by the virtual. It was difficult putting into words the difference between, for example, moving on a y-axis and how that differed from making the item scale in size. While this was eventually smoothed out with experience, we knew we had to keep this in mind for Etsy buyers, as augmented reality is not a common experience for most people. For example, how would we coach them through the fact that ARKit needs them to use the camera to scan the room to find the edges of the wall in order to discern the vertical plane? What makes it apparent that they can tap on screen? In order to give our users an inclination of how to use this feature successfully, our designer, Kate Matsumoto, product manager, Han Cho, and copywriter, Jed Baker, designed an onboarding flow, based on user-testing, that walks our buyers through this new experience.
Wrapping it All Up
Using machine learning to determine if we should crop an image or not, cropping it based on a strong rectangle, and presenting the artwork on a real wall was only part of the picture here. Assisted by Evan Wolf and Nook Harquail, we also dealt with complex problems including parsing item descriptions to gather dimension, raytraced hit-testing, and color averaging to make this feature feel as seamless and lifelike as possible for Etsy buyers. From here, we have plenty of ideas for continuing to improve this experience but in the meantime, I encourage you to consider the fantastic frameworks you have at your disposal, and how you can link them together to create an experience that seemed impossible just years ago.
Hi! We’re the Etsy Engineering team that supports core IT and AV capabilities for all Etsy employees. Working across geographies has always been part of our company’s DNA; our globally distributed teams use collaboration tools like Google apps, Slack, and video conferencing. As we transitioned to a fully distributed model during COVID-19, we faced both unexpected challenges and opportunities to permanently improve our support infrastructure. In this post we will share some of the actions we took to support our staff as they spread out across the globe.
Digging deeper on our core values
Keeping Support Human
Our team’s core objective is to empower Etsy employees to do their best work. We give them the tools they need and we teach, train, and support them to use those tools as best they can. We also document and share our work in the form of user guides and support run-books. With friendly interactions during support, we strive to embody Etsy’s mission to Keep Commerce Human®.
Despite being further physically distributed, we found ways to increase human connections.
We launched daily tips that are delivered in Slack, inspired by ideas from teams across Etsy, and we prioritized items that were most relevant to helping folks navigate their new work setups.
In addition to our existing monitored #helpdesk Slack channel, we hosted live, virtual helpdesk hours on video.
We added downloadable self-service software in our internal IT tools that reduced the load on our front-line helpdesk team, and enabled employees to quickly resolve some common issues with easily digestible one-page guides.
Before Etsy went fully remote, a common meeting setup included teams in multiple office locations connecting through video-conference-enabled conference rooms with additional remote participants dialing in. To better support the volume of video calls we needed with all employees WFH, we accelerated our planned video conferencing solution migration to Google Meet. We also quickly engineered solutions to integrate Google Meet, including making video conference details the default option in calendar invites and enabling add-ons that improve the video call experience. Within a month we had a 1000% increase in Google Meet usage and a ~60% drop off of the old platform.
We also adapted our large team events, such as department-wide all hands meetings, to support a full remote experience. We created new “ask me anything” formats and shortened some meetings’ length. To make the meetings run smoothly, we added additional behind-the-scenes prep for speakers, gathered Q&A in advance, and created user guides so teams can self-manage complex events.
Continuing Project Progress
We reviewed our committed and potential project list and decided where we could prioritize focus, adapting to the new needs of our employees. Standing up additional monitoring tools allowed us to be even more proactive about the health of our end-point fleet. We also seized opportunities to take advantage of our empty offices to do work that would have otherwise been disruptive. We were able to complete firewall and AV equipment firmware upgrades (remotely, of course) in a fraction of the time it would have taken us with full offices.
In summary, some learnings
Collaboration is key
Our team is very fortunate to have strong partners, buoyed by supportive leadership, operating in an inclusive company culture that highly values innovation. Much of our success in this highly unique situation is a result of multiple departments coming together quickly, sharing information as they received it, and being flexible during rapid change. For example, we partnered with our recruiting, human resources, and people development teams to adjust how we would onboard new employees, contractors, and short term temporary employees, ensuring we properly deployed equipment and smoothly introduced them to Etsy.
Respect the diversity of WFH situations
We’ve dug deeper into ways to help all our employees work effectively at home. We’re constantly learning, but we continue to build a robust “how to work from home” guide and encourage transparency around each employee’s challenges so that we can help find solutions. Home networks can be a major point of friction and we’ve built out guides to help our employees optimize their network and Wi-Fi setups.
Empathy for each other
Perhaps most of all, through this experience we’ve gained an increased level of empathy for our peers. We’ve learned that there are big differences between working from home for one day, being a full-time remote employee, and working in isolation during a global crisis. We’re using this awareness to rethink the future of our meeting behaviors, the technology in our conference rooms, and the way we engage with each other throughout the day, whether we’re in or out of the office.
Etsy has been increasingly enjoying the perks of public cloud infrastructure for a few years, but has been missing a crucial feature: we’ve been unable to measure our progress against one of our key impact goals for 2025 — to reduce our energy intensity by 25%. Cloud providers generally do not disclose to customers how much energy their services consume. To make up for this lack of data, we created a set of conversion factors called Cloud Jewels to help us roughly convert our cloud usage information (like Google Cloud usage data) into approximate energy used. We are publishing this research to begin a conversation and a collaboration that we hope you’ll join, especially if you share our concerns about climate change.
This isn’t meant as a replacement for energy use data or guidance from Google, Amazon or another provider. Nor can we guarantee the accuracy of the rough estimates the tool provides. Instead, it’s meant to give us a sense of energy usage and relative changes over time based on aggregated data on how we use the cloud, in light of publicly-available information.
A little background
In the face of a changing climate, we at Etsy are committed to reducing our ecological footprint. In 2017, we set a goal of reducing the intensity of our energy use by 25% by 2025, meaning we should use less energy in proportion to the size of our business. In order to evaluate ourselves against our 25% energy intensity reduction goal, we have historically measured our energy usage across our footprint, including the energy consumption of servers in our data centers.
In early 2020, we finished our two-year migration from our own physical servers in a colocated data center to Google Cloud. In addition to the massive increase in the power and flexibility of our computing capabilities, the move was a win for our sustainability efforts because of the efficiency of Google’s data centers. Our old data centers had an average PUE (Power Usage Effectiveness) of 1.39 (FY18 average across colocated data centers), whereas Google’s data centers have a combined average PUE of 1.10. PUE is a ratio of the total amount of energy a data center uses to how much energy goes to powering computers. It captures how efficient factors like the building itself and air conditioning are in the data center.
While a lower PUE helps our energy footprint significantly, we need to be able to measure and optimize the amount of power that our servers draw. Knowing how much energy each of our workloads uses helps us make design and code decisions that optimize for sustainability. The Google Cloud team has been a terrific partner to us throughout our migration, but they are unable to provide us with data about our cloud energy consumption. This is a challenge across the industry: neither Amazon Web Services nor Microsoft Azure provide this information to customers. We have heard concerns that range from difficulties attributing energy use to individual customers to sensitivities around proprietary information that could reveal too much about cloud providers’ operations and financial position.
We thought about how we might be able to estimate our energy consumption in Google Cloud using the data we do have: Google provides us with usage data that shows us how many virtual CPU (Central Processing Unit) seconds we used, how much memory we requested for our servers, how many terabytes of data we have stored for how long, and how much networking traffic we were responsible for.
Our supposition was that if we could come up with general estimates for how many watt-hours (Wh) compute, storage and networking draw in a cloud environment, particularly based on public information, then we could apply those coefficients to our usage data to get at least a rough estimate of our cloud computing energy impact.
We are calling this set of estimated conversion factors Cloud Jewels. Other cloud computing consumers can look at this and see how it might work with their own energy usage across providers and usage data. The goal is to help cloud users across the industry to help refine our estimates, and ultimately help us encourage cloud providers to empower their customers with more accurate cloud energy consumption data.
We roughly assumed that we could attribute power used to:
running a virtual server (compute),
Using the resources we found online, we were able to determine what we think are reasonable, conservative estimates for the amount of energy that compute and storage tasks consume. We are aiming for a conservative over-estimate of energy consumed to make sure we are holding ourselves fully accountable for our computing footprint. We have yet to determine a reasonable way to estimate the impact of RAM or network usage, but we welcome contributions to this work! We are open-sourcing a script for others to apply these coefficients to their usage data, and the full methodology is detailed in our repository on Github.
Cloud Jewels coefficients
The following coefficients are our estimates for how many watt-hours (Wh) it takes to run a virtual server and how many watt-hours (Wh) it takes to store a terabyte of data on HDD (hard disk drive) or SSD (solid-state drive) disks in a cloud computing environment:
2.10 Wh per vCPUh [Server]
0.89 Wh/TBh for HDD storage [Storage]
1.52 Wh/TBh for SSD storage [Storage]
As you may note: we are using point estimates without confidence intervals. This is partly intentional and highlights the experimental nature of this work. Our sources also provide single, rough estimates without confidence intervals, so we decided against numerically estimating our confidence so as to not provide false precision. Our work has been reviewed by several industry experts and our energy and carbon metrics for cloud computing have been assured by PricewaterhouseCoopers LLP. That said, we acknowledge that this estimation methodology is only a first step in giving us visibility into the ecological impacts of our cloud computing usage, which may evolve as our understanding improves. Whenever there has been a choice, we have erred on the side of conservative estimates, taking responsibility for more energy consumption than we are likely using to avoid overestimating our savings. While we have limited data, we are using these estimates as a jumping-off point and carrying forth in order to push ourselves and the industry forward. We especially welcome contributions and opinions. Let the conversation begin!
Server wattage estimate
At a high level, to estimate server wattage, we used a general formula for calculating server energy use over time:
W = Min + Util*(Max – Min)
Wattage = Minimum wattage + Average CPU Utilization * (Maximum wattage – minimum wattage)
To determine minimum and maximum CPU wattage, we averaged the values reported by manufacturers of servers that are available in the SPEC power database (filtered to servers that we deemed likely to be similar to Google’s servers), and we used an industry average server utilization estimate (45%) from the US Data Center Energy Usage Report.
Storage wattage estimate
To estimate storage wattage, we used industry-wide estimates from the U.S. Data Center Usage Report. That report contains estimated average capacity of disks as well as average disk wattage. We used both those estimates to get an estimated wattage per terabyte.
The resources we found related to networking energy estimates were for general internet data transfer, as opposed to intra data center traffic between connected servers. Networking also made up a significantly smaller portion of our overall usage cost, so we are assuming it requires less energy than compute and storage. Finally, as far as the research we found indicated, the energy attributable to networking is generally far smaller than that attributable to compute and storage.
Application to usage data
We aggregated and grouped our usage data by SKU then categorized it by which type of service was applicable (“compute”, “storage”, “n/a”), converted the units to hours and terabyte-hours, then applied our coefficients. Since we do not yet have a coefficient for networking or RAM that we feel confident in, we are leaving that data out for now. The experts we have consulted with are confident that our coefficients are conservative enough to account for our overall energy consumption without separate consideration for networking and RAM.
Applying our Cloud Jewels coefficients to our aggregated usage data and comparing the estimates to our former local data center actual kWh totals over the past two years indicates that our energy footprint in Google Cloud is smaller than it was on premises. It’s important to note that we are not taking into account networking or RAM, nor Google-maintained services like BigQuery, BigTable, StackDriver, or App Engine. However, overall, relatively speaking over time (assuming our estimates are even moderately close to accurate and verified to be conservative), we are on track to be using less overall energy to do more computing than we were two years ago (as our business has grown), meaning we are making progress towards our energy intensity reduction goal.
We used historical data to estimate what our energy savings are since moving to Google Cloud.
Our estimated savings over the five year period are roughly equivalent to:
~20,000 sewing machines (running 24/7)
~147,000 light bulbs (running 24/7)
~1,200 dishwashers (running 24/7)
We would next like to find ways to estimate the energy cost of network traffic and memory. There are also minor refinements we could make to our current estimates, though we want to ensure that further detail does not lead to false precision, that we do not overcomplicate the methodology, and that the work we publish is as generally applicable and useful to other companies as possible.
Part of our reasoning for open-sourcing this work is selfish: we want input! We welcome contributions to our estimates and additional resources that we should be using to refine them. We hope that publishing these coefficients will help other companies who use cloud computing providers estimate their energy footprint. And finally we hope that efforts and estimations encourage more public information about cloud energy usage, and particularly help cloud providers find ways to determine and deliver data like this, either as broad coefficients for estimation or actual energy usage metrics collected from their internal monitoring.