Etsy Engineering | The Causal Analysis of Cannibalization in Online Products

By Xuan Yin, Ercan Yildiz

Feb 24, 2020

This article mainly draws on our published paper in KDD 2019 (Oral Presentation, Selection Rate 6%, 45 out of 700).

Introduction

Nowadays an internet company typically has a wide range of online products to fulfill customer needs. It is common for users to interact with multiple online products on the same platform and at the same time. Consider, for example, Etsy’s marketplace. There are organic search, recommendation modules (recommendations), and promoted listings enabling users to find interesting items. Although each of them offers a unique opportunity for users to interact with a portion of the overall inventory, they are functionally similar and contest the limited time, attention, and monetary budgets of users.

To optimize users’ overall experiences, instead of understanding and improving these products separately, it is important to gain insights into the evidence of cannibalization: an improvement in one product induces users to decrease their engagement with other products. Cannibalization is very difficult to detect in the offline evaluation, while frequently shows up in online A/B tests.

Consider the following example, an A/B test for a recommendation module. A typical A/B test of a recommendation module commonly involves the change in the underlying machine learning algorithm, its user interface, or both. The recommendation change significantly increased users’ clicks on the recommendation while significantly decreasing users’ clicks on organic search results.

Table 1: A/B Test Results for Recommendation Module

There is an intuitive explanation to the drop in search clicks: users might not need to search as much as usual because they could find what they were looking for through recommendations. In other words, improved recommendations effectively diverted users’ attention away from search and thus cannibalized the user engagement in search.

Note that increased recommendation clicks did not translate into observed gains in key performance indicators: conversion and Gross Merchandise Sales (GMS). Conversion and GMS are typically measured at the sitewide level because the ultimate goal of the improvement of any product on our platform is to facilitate a better user experience about etsy.com. The launch decision of a new algorithm is usually based on the significant gain in conversion/GMS from A/B tests. The insignificant conversion/GMS gain and the significant lift in recommendation clicks challenge product owners when deciding to terminate the new algorithm. They wonder whether the cannibalization in search clicks could, in turn, cannibalize conversion/GMS gain from recommendation. In other words, it is plausible that the improved recommendations should have brought more significant increases of conversion/GMS than what the A/B test shows, and its positive impact is partially offset by the negative impact from the cannibalized user engagement in search. If there is cannibalization in conversion/GMS gain, then, instead of terminating it, it is advisable to launch the new recommendation algorithm and revise the search algorithm to work better with the new recommendation algorithm; otherwise, the development of recommendation algorithms would be hurt.

The challenge asks for separating the revenue loss (through search) from the original revenue gain (from the recommendation module change). Unfortunately, from the A/B tests, we can only observe the cannibalization in user engagement (the induced reduction in search clicks).

Flaws of Purchase-Funnel Based Attribution Metrics

Specific product revenue is commonly attributed based on purchase-funnel/user-journey. For example, the purchase-funnel of recommendations could be defined as a sequence of user actions: “click A in recommendations → purchase A”. To compute recommendation-attributed conversion rate, we have to segment all the converted users into two groups: those who follow the pattern of the purchase-funnel and those who do not. Only the first segment is used for counting the number of conversions.

However, the validity of the attribution is questionable. In many A/B tests of new recommendation algorithms, it is common for recommendation-attributed revenue change to be over +200% and search-attributed revenue change to be around -1%. It is difficult to see how the conversion lift is cannibalized and dropped from +200% to the observed +0.2%. These peculiar numbers remind us that attribution metrics based on purchase-funnel are unexplainable and unreliable for at least two reasons.

First, users usually take more complicated journeys than a heuristically-defined purchase-funnel can capture. Here are two examples:

If the recommendations make users stay longer on Etsy, and users click listings on other pages and modules to make purchases, then the recommendation-attributed metrics fail to capture the contribution of the recommendations to these conversions. The purchase-funnel is based on “click”, and there is no way to incorporate “dwell time” to the purchase-funnel.
Suppose the true user journey is “click A in recommendation → search A → click A in search results → click A in many other places → purchase A”. Shall the conversion be attributed to recommendation or search? Shall all the visited pages and modules share the credit of this conversion? Any answer would be too heuristic to be convincing.

Second, attribution metrics cannot measure any causal effects. The random assignment of users in an A/B test makes treatment and control buckets comparable and thus enables us to calculate the average treatment effect (ATE). The segments of users who follow the pattern of purchase-funnel may not be comparable between the two buckets, because the segmentation criterion (i.e., user journey) happens after random assignment and thus the segments of users are not randomized between the two buckets. In causal inference, factors that cause users to follow the pattern of purchase-funnel would be introduced by the segmentation and thus confound the causality between treatment and outcome. Any post-treatment segmentation could break the ignorability assumption of the causal identification and invalidate the causality in experiment analysis (see, e.g., Montgomery et al., 2018).

Causal Mediation Analysis

We exploit search clicks as a mediator in the causal path between recommendation improvement and conversion/GMS, and extend a formal framework of causal inference, causal mediation analysis (CMA), to separate the cannibalized effect from the original effect of the recommendation module change. CMA splits the observed conversion/GMS gains (average treatment effect, ATE) in A/B tests into the gains from the recommendation improvement (direct effect) and the losses due to cannibalized search clicks (indirect effect). In other words, the framework allows us to measure the impacts of recommendation improvement on conversion/GMS directly as well as indirectly through a mediator such as search (Figure 1). The significant drop in search clicks makes it a good candidate for the mediator. In practice, we can try different candidate mediators and use the analysis to confirm which one is the mediator.

Figure 1: Directed Acyclic Graph (DAG)
Note: It illustrates the causal mediation in recommendation A/B test.

However, it is challenging to implement CMA of the literature directly in practice. An internet platform typically has tons of online products and all of them could be mediators on the causal path between the tested product and the final business outcomes. Figure 2 shows that multiple mediators (M1, M0, and M2) are on the causal path between treatment T and the final business outcome Y. In practice, it is very difficult to measure user engagement in all these mediators. Multiple unmeasured causally-dependent mediators in A/B tests break the sequential ignorability assumption in CMA and invalidates CMA (see Imai et al. (2010) for assumptions in CMA).

Figure 2: DAG of Multiple Mediators
Note: M0 and M2 are upstream and downstream mediators of the mediator M1 respectively.

We define generalized average causal mediation effect (GACME) and generalized average direct effect (GADE) to analyze the second cannibalism. GADE captures the average causal effect of the treatment T that goes through all the channels that do not have M1. GACME captures the average causal effect of the treatment T that goes through all the channels that have M1. We proved that, under some assumptions, GADE and GACME are identifiable even though there are numerous unmeasured causally-dependent mediators. If there is no unmeasured mediator, then GADE and GACME collapse to ADE and ACME. If there is, then ADE and ACME cannot be identified while GADE and GACME can.

Table 2 shows the sample results. The recommendation improvement led to a 0.5% conversion lift, but the cannibalized search clicks resulted in a 0.3% conversion loss, and the observed revenue did not change significantly. When the outcome is GMS, we can see the loss through cannibalized search clicks as well. The results justify the cannibalization in conversion lift, and serve as evidence to support the launch of the new recommendation module.

Table 2: Causal Mediation Analysis on Simulated Experiment Data

The implementation follows a causal mediation-based methodology we recently developed and published on KDD 2019. We also made a fun video describing the intuition behind the methodology. It is easy to implement and only requires solving two linear regression equations simultaneously (Section 4.4 of the paper). We simply need the treatment assignment indicator, search clicks, and the observed revenue for each experimental unit. Interested readers can refer to our paper for more details and our GitHub repo for the analysis code.

We have successfully deployed our model to identify products that are prone to cannibalization. In particular, it has helped product and engineering teams understand the tradeoffs between search and recommendations, and focus on the right opportunities. The direct effect on revenue is a more informative key performance indicator than the observed average treatment effect to measure the true contribution of a product change to the marketplace and to guide the decision on the launch of new product features.

Code as Craft

Categories

Events

Careers

Back

Engineering Management

Consumer Product Development

How We Work

Search, Ads, Recs

Working in the Cloud

Building Globally

Evolving Our Monorepo

Programming

Experimentation

Code Mosaic

The Causal Analysis of Cannibalization in Online Products

Introduction

Flaws of Purchase-Funnel Based Attribution Metrics

Causal Mediation Analysis

Macramé: Untangling the Knot on the Etsy Android Listing Screen

How We Built The Deals Tab in Swift UI

Behind the Scenes - A Glimpse to Tax Calculations

Recommended Posts

The AR Measuring Box: Etsy's answer to Big Tape Measure

Priority Hints - What Your Browser Doesn’t Know (Yet)

A Checklist Manifetsy

Code as Craft

Share

The Causal Analysis of Cannibalization in Online Products

Introduction

Flaws of Purchase-Funnel Based Attribution Metrics

Causal Mediation Analysis

Recommended Posts

The AR Measuring Box: Etsy's answer to Big Tape Measure

Priority Hints - What Your Browser Doesn’t Know (Yet)

A Checklist Manifetsy