Chaining iOS Machine Learning, Computer Vision, and Augmented Reality to Make the Magical Real

Posted by on June 23, 2020 / No Responses
Four screenshots of Etsy's iOS Augmented Reality feature

Etsy recently released a feature in our buyer-facing  iOS app that allows users to visualize wall art within their environments. Getting the context of a personal piece of art within your space can be a meaningful way to determine whether the artwork will look just as good in your room as it does on your screen. The new feature uses augmented reality to bridge that gap, meshing the virtual and real worlds. Read on to learn how we made this possible using aspects of machine learning and computer vision to present the best version of Etsy sellers’ artwork in augmented reality. It didn’t even require a PhD.-level education or an expensive 3rd party vendor – we did it all with tools provided by iOS.

Building a Chain

Using Computers to See

Early in 2019, I put together a quick proof of concept that allowed for wall art to be displayed on a vertical plane, which required a standalone image of the artwork filling the entire image. Oftentimes, though, Etsy sellers upload images that show their item in context, like on a living room wall, to show scale. This complicates the process because these listing images can’t be placed onto vertical planes in augmented reality as-is; they need to be reformatted and cropped.  

An image of a frame on a wall being highlighted to show what should be cropped.

Two engineers, Chris Morris and Jake Kirshner developed a solution that used computer vision to find a rectangle within an image, perhaps a frame, and crop the image for use.  Using the Vision framework in iOS, they were able to pull out the artwork we needed to place in 3D space.  We found that trying to detect only one rectangle, as opposed to all, created performance wins and gave us the shape with greatest confidence by the system. Afterwards, we used Core Image in order to crop the image, adjusting for any perspective skew that might be present. Apple has an example using a frame buffer but can be applied to any UIImage.

To Crop or Not to Crop

As I mentioned before, some Etsy sellers upload their artwork as standalone images, while others depict their artwork in different environments. We wanted to present the former as-is, and we needed to crop the latter, but we had no way to automatically categorize the more than 5 million artwork listings available on our marketplace.

Two piles of images sorted by original artwork and being displayed on a wall.

To solve this, we used on-device machine learning provided by Core ML. The team sifted through more than 1,200 listings and sorted the images by those that should be cropped and those that should not be cropped. To create the machine learning model, we first used an iOS Playground and, later, a Mac application called Create ML. The process was as easy as dropping a directory with two subdirectories filled with correct images, “no_frames” and “frames”, into the application along with a corresponding smaller set of different images used to test the resulting model. Once this model was created and verified, we used VNCoreMLRequest to check a listing’s image and determine whether we should crop it or present it as-is. This type of model is known as image classification

We also investigated a different type of mode called object detection, which finds the existence and coordinates of a frame within an image. This technique had two downsides: training the model required laborious manual object marking for each image provided, and the resulting model, which would be included in our app bundle, would be well over 60mb vs. the 15kb model for image classification. That’s right, kilobytes.

Translating Two Dimensions to Three

Once we had the process for determining whether the image needs to be reformatted,  we used a combination of iOS’ SceneKit and ARKit to place the artwork as a material on a rudimentary shape. With Apple focusing heavily on this space, we were able to find plenty of great examples and tutorials to get us started with augmented reality on iOS. We started with the easy-to-use RealityKit framework, but the iOS 13-only restriction was a blocker as we supported back to iOS 11 at the time.  

An animated gif depicting a listing image being presented on the wall.

The implementation in ARKit was relatively straightforward, technically, but working for the first time in 3D space vs. a flat screen, it was a challenge to develop a vocabulary and way of thinking about the physical space being altered by the virtual. It was difficult putting into words the difference between, for example,  moving on a y-axis and how that differed from making the item scale in size. While this was eventually smoothed out with experience, we knew we had to keep this in mind for Etsy buyers, as augmented reality is not a common experience for most people. For example, how would we coach them through the fact that ARKit needs them to use the camera to scan the room to find the edges of the wall in order to discern the vertical plane? What makes it apparent that they can tap on screen? In order to give our users an inclination of how to use this feature successfully, our designer, Kate Matsumoto, product manager, Han Cho, and copywriter, Jed Baker, designed an onboarding flow, based on user-testing, that walks our buyers through this new experience.

Wrapping it All Up

Using machine learning to determine if we should crop an image or not, cropping it based on a strong rectangle, and presenting the artwork on a real wall was only part of the picture here. Assisted by Evan Wolf and Nook Harquail, we also dealt with complex problems including parsing item descriptions to gather dimension, raytraced hit-testing, and color averaging to make this feature feel as seamless and lifelike as possible for Etsy buyers. From here, we have plenty of ideas for continuing to improve this experience but in the meantime, I encourage you to consider the fantastic frameworks you have at your disposal, and how you can link them together to create an experience that seemed impossible just years ago.

Referenced Frameworks:

No Comments

Keeping IT Support Human during WFH

Posted by on May 6, 2020 / 1 Comment

Image: Human Connection, KatieWillDesign on Etsy

Hi! We’re the Etsy Engineering team that supports core IT and AV capabilities for all Etsy employees. Working across geographies has always been part of our company’s DNA; our globally distributed teams use collaboration tools like Google apps, Slack, and video conferencing. As we transitioned to a fully distributed model during COVID-19, we faced both unexpected challenges and opportunities to permanently improve our support infrastructure. In this post we will share some of the actions we took to support our staff as they spread out across the globe.  

Digging deeper on our core values

Keeping Support Human

Our team’s core objective is to empower Etsy employees to do their best work. We give them the tools they need and we teach, train, and support them to use those tools as best they can. We also document and share our work in the form of user guides and support run-books. With friendly interactions during support, we strive to embody Etsy’s mission to Keep Commerce Human®

Despite being further physically distributed, we found ways to increase human connections.

Staying Connected

Before Etsy went fully remote, a common meeting setup included teams in multiple office locations connecting through video-conference-enabled conference rooms with additional remote participants dialing in. To better support the volume of video calls we needed with all employees WFH, we accelerated our planned video conferencing solution migration to Google Meet. We also quickly engineered solutions to integrate Google Meet, including making video conference details the default option in calendar invites and enabling add-ons that improve the video call experience. Within a month we had a 1000% increase in Google Meet usage and a ~60% drop off of the old platform. 

We also adapted our large team events, such as department-wide all hands meetings, to support a full remote experience. We created new “ask me anything” formats and shortened some meetings’ length. To make the meetings run smoothly, we added additional behind-the-scenes prep for speakers, gathered Q&A in advance, and created user guides so teams can self-manage complex events.

Continuing Project Progress

We reviewed our committed and potential project list and decided where we could prioritize focus, adapting to the new needs of our employees. Standing up additional monitoring tools allowed us to be even more proactive about the health of our end-point fleet.  We also seized opportunities to take advantage of our empty offices to do work that would have otherwise been disruptive. We were able to complete firewall and AV equipment firmware upgrades (remotely, of course) in a fraction of the time it would have taken us with full offices.

In summary, some learnings

Collaboration is key 

Our team is very fortunate to have strong partners, buoyed by supportive leadership, operating in an inclusive company culture that highly values innovation. Much of our success in this highly unique situation is a result of multiple departments coming together quickly, sharing information as they received it, and being flexible during rapid change. For example, we partnered with our recruiting, human resources, and people development teams to adjust how we would onboard new employees, contractors, and short term temporary employees, ensuring we properly deployed equipment and smoothly introduced them to Etsy. 

Respect the diversity of WFH situations

We’ve dug deeper into ways to help all our employees work effectively at home. We’re constantly learning, but we continue to build a robust “how to work from home” guide and encourage transparency around each employee’s challenges so that we can help find solutions. Home networks can be a major point of friction and we’ve built out guides to help our employees optimize their network and Wi-Fi setups.

Empathy for each other

Perhaps most of all, through this experience we’ve gained an increased level of empathy for our peers. We’ve learned that there are big differences between working from home for one day, being a full-time remote employee, and working in isolation during a global crisis. We’re using this awareness to rethink the future of our meeting behaviors, the technology in our conference rooms, and the way we engage with each other throughout the day, whether we’re in or out of the office.

1 Comment

Cloud Jewels: Estimating kWh in the Cloud

Blue jewels rain down from golden clouds in these Lightning Storm Earrings by GojoDesign on Etsy.

Image: Lightning Storm Earrings, GojoDesign on Etsy

Etsy has been increasingly enjoying the perks of public cloud infrastructure for a few years, but has been missing a crucial feature: we’ve been unable to measure our progress against one of our key impact goals for 2025 — to reduce our energy intensity by 25%. Cloud providers generally do not disclose to customers how much energy their services consume. To make up for this lack of data, we created a set of conversion factors called Cloud Jewels to help us roughly convert our cloud usage information (like Google Cloud usage data) into approximate energy used. We are publishing this research to begin a conversation and a collaboration that we hope you’ll join, especially if you share our concerns about climate change.

This isn’t meant as a replacement for energy use data or guidance from Google, Amazon or another provider. Nor can we guarantee the accuracy of the rough estimates the tool provides. Instead, it’s meant to give us a sense of energy usage and relative changes over time based on aggregated data on how we use the cloud, in light of publicly-available information.

A little background

In the face of a changing climate, we at Etsy are committed to reducing our ecological footprint. In 2017, we set a goal of reducing the intensity of our energy use by 25% by 2025, meaning we should use less energy in proportion to the size of our business. In order to evaluate ourselves against our 25% energy intensity reduction goal, we have historically measured our energy usage across our footprint, including the energy consumption of servers in our data centers.

Three graphs to illustrate decreased energy intensity (increased efficiency), showing that energy usage would grow less quickly than business size.

In early 2020, we finished our two-year migration from our own physical servers in a colocated data center to Google Cloud. In addition to the massive increase in the power and flexibility of our computing capabilities, the move was a win for our sustainability efforts because of the efficiency of Google’s data centers. Our old data centers had an average PUE (Power Usage Effectiveness) of 1.39 (FY18 average across colocated data centers), whereas Google’s data centers have a combined average PUE of 1.10. PUE is a ratio of the total amount of energy a data center uses to how much energy goes to powering computers. It captures how efficient factors like the building itself and air conditioning are in the data center.

Illustration of PUE (Power Usage Effectiveness) as the ratio of overall power used by a datacenter to the power used by computers in it.

While a lower PUE helps our energy footprint significantly, we need to be able to measure and optimize the amount of power that our servers draw. Knowing how much energy each of our workloads uses helps us make design and code decisions that optimize for sustainability. The Google Cloud team has been a terrific partner to us throughout our migration, but they are unable to provide us with data about our cloud energy consumption. This is a challenge across the industry: neither Amazon Web Services nor Microsoft Azure provide this information to customers. We have heard concerns that range from difficulties attributing energy use to individual customers to sensitivities around proprietary information that could reveal too much about cloud providers’ operations and financial position.

We thought about how we might be able to estimate our energy consumption in Google Cloud using the data we do have: Google provides us with usage data that shows us how many virtual CPU (Central Processing Unit) seconds we used, how much memory we requested for our servers, how many terabytes of data we have stored for how long, and how much networking traffic we were responsible for. 

Our supposition was that if we could come up with general estimates for how many watt-hours (Wh) compute, storage and networking draw in a cloud environment, particularly based on public information, then we could apply those coefficients to our usage data to get at least a rough estimate of our cloud computing energy impact.

We are calling this set of estimated conversion factors Cloud Jewels. Other cloud computing consumers can look at this and see how it might work with their own energy usage across providers and usage data. The goal is to help cloud users across the industry to help refine our estimates, and ultimately help us encourage cloud providers to empower their customers with more accurate cloud energy consumption data.

Illustration of what Cloud Jewels seeks to quantify: the power used by compute and storage.

Methodology

The sources that most influenced our methodology were the U.S. Data Center Energy Usage Report, The Data Center as a Computer, and the SPEC power report. We also spoke with industry experts Arman Shehabi, Jon Koomey, and Jon Taylor, who suggested additional resources and reviewed our methodology.

We roughly assumed that we could attribute power used to: 

Using the resources we found online, we were able to determine what we think are reasonable, conservative estimates for the amount of energy that compute and storage tasks consume. We are aiming for a conservative over-estimate of energy consumed to make sure we are holding ourselves fully accountable for our computing footprint. We have yet to determine a reasonable way to estimate the impact of RAM or network usage, but we welcome contributions to this work! We are open-sourcing a script for others to apply these coefficients to their usage data, and the full methodology is detailed in our repository on Github.

Cloud Jewels coefficients

The following coefficients are our estimates for how many watt-hours (Wh) it takes to run a virtual server and how many watt-hours (Wh) it takes to store a terabyte of data on HDD (hard disk drive) or SSD (solid-state drive) disks in a cloud computing environment:

2.10 Wh per vCPUh [Server]

0.89 Wh/TBh for HDD storage [Storage]

1.52 Wh/TBh for SSD storage [Storage]

On confidence

As you may note: we are using point estimates without confidence intervals. This is partly intentional and highlights the experimental nature of this work. Our sources also provide single, rough estimates without confidence intervals, so we decided against numerically estimating our confidence so as to not provide false precision. Our work has been reviewed by several industry experts and our energy and carbon metrics for cloud computing have been assured by PricewaterhouseCoopers LLP. That said, we acknowledge that this estimation methodology is only a first step in giving us visibility into the ecological impacts of our cloud computing usage, which may evolve as our understanding improves. Whenever there has been a choice, we have erred on the side of conservative estimates, taking responsibility for more energy consumption than we are likely using to avoid overestimating our savings. While we have limited data, we are using these estimates as a jumping-off point and carrying forth in order to push ourselves and the industry forward. We especially welcome contributions and opinions. Let the conversation begin!

Server wattage estimate

At a high level, to estimate server wattage, we used a general formula for calculating server energy use over time:

W = Min + Util*(Max – Min)

Wattage = Minimum wattage + Average CPU Utilization * (Maximum wattage – minimum wattage)

A graph portrays CPU Utilization increasing and decreasing over time.

To determine minimum and maximum CPU wattage, we averaged the values reported by manufacturers of servers that are available in the SPEC power database (filtered to servers that we deemed likely to be similar to Google’s servers), and we used an industry average server utilization estimate (45%) from the US Data Center Energy Usage Report.

Storage wattage estimate

To estimate storage wattage, we used industry-wide estimates from the U.S. Data Center Usage Report. That report contains estimated average capacity of disks as well as average disk wattage. We used both those estimates to get an estimated wattage per terabyte.

Networking non-estimate

The resources we found related to networking energy estimates were for general internet data transfer, as opposed to intra data center traffic between connected servers. Networking also made up a significantly smaller portion of our overall usage cost, so we are assuming it requires less energy than compute and storage. Finally, as far as the research we found indicated, the energy attributable to networking is generally far smaller than that attributable to compute and storage.

A graph shows trends of US data center electricity use from 2000-2020. Two alternative scenarios begin in 2010; a steeply increasing line portrays the increase in usage if efficiency remained at its 2010 level through 2020. A decreasing line portrays the electricity usage if best practices for energy usage were adopted.
Source: Arman Shehabi, Sarah J Smith, Eric Masanet and Jonathan Koomey; Data center growth in the United States: decoupling the demand for services from electricity use; 2018

Application to usage data

We aggregated and grouped our usage data by SKU then categorized it by which type of service was applicable (“compute”, “storage”, “n/a”), converted the units to hours and terabyte-hours, then applied our coefficients. Since we do not yet have a coefficient for networking or RAM that we feel confident in, we are leaving that data out for now. The experts we have consulted with are confident that our coefficients are conservative enough to account for our overall energy consumption without separate consideration for networking and RAM.

Results

Applying our Cloud Jewels coefficients to our aggregated usage data and comparing the estimates to our former local data center actual kWh totals over the past two years indicates that our energy footprint in Google Cloud is smaller than it was on premises. It’s important to note that we are not taking into account networking or RAM, nor Google-maintained services like BigQuery, BigTable, StackDriver, or App Engine. However, overall, relatively speaking over time (assuming our estimates are even moderately close to accurate and verified to be conservative), we are on track to be using less overall energy to do more computing than we were two years ago (as our business has grown), meaning we are making progress towards our energy intensity reduction goal. 

A graph displays monthly total kWh used with Etsy's former datacenter (actual) compared to with Google Cloud (estimated). Google Cloud kWh is significantly lower.

We used historical data to estimate what our energy savings are since moving to Google Cloud.

A graph shows estimated annual consumption in kWh with Etsy's former colocated datacenter (actual) compared to with Google Cloud (estimated). Google Cloud annual consumption is significantly lower.

Assumes ~16% YoY growth in former colocated data centers and actual/expected ~23% YoY growth in cloud usage between 2019-20 and beyond.

Our estimated savings over the five year period are roughly equivalent to: 

Next steps

We would next like to find ways to estimate the energy cost of network traffic and memory. There are also minor refinements we could make to our current estimates, though we want to ensure that further detail does not lead to false precision, that we do not overcomplicate the methodology, and that the work we publish is as generally applicable and useful to other companies as possible.

Part of our reasoning for open-sourcing this work is selfish: we want input! We welcome contributions to our estimates and additional resources that we should be using to refine them. We hope that publishing these coefficients will help other companies who use cloud computing providers estimate their energy footprint. And finally we hope that efforts and estimations encourage more public information about cloud energy usage, and particularly help cloud providers find ways to determine and deliver data like this, either as broad coefficients for estimation or actual energy usage metrics collected from their internal monitoring.

3 Comments

Developing in a Monorepo While Still Using Webpack

Posted by on April 6, 2020 / 2 Comments

When I talk to friends and relatives about what I do at Etsy, I have to come up with an analogy about what Frontend Infrastructure is. It’s a bit tricky to describe because it’s something that you don’t see as an end user; the web pages that people interact with are several steps removed from the actual work that a frontend infrastructure engineer does. The analogy that I usually fall to is that of a restaurant: the meal is a fully formed web page, the chefs are product engineers, and the kitchen is the infrastructure. A good kitchen should make it easy to cook a bunch of different meals quickly and deliciously. Recently, my team and I spent over a year swapping out our home-grown, Require-js-based JavaScript build system for Webpack. Running with this analogy a bit, this project is like trading out our kitchen without customers noticing, and without bothering the chefs too much.  Large projects tend to be full of unique problems and unexpected hurdles, and this one was no exception. This post is the second in a short series on all the things that we learned during the migration, and is adapted in part from a talk I gave at JSConf 2019. The first post can be found here.


The state of JavaScript at Etsy last year.

At Etsy, we have a whole lot of JavaScript. This alone doesn’t make us very unique, but we have something that not every other company has: a monorepo. When we deploy our web code, we need to build and deploy over 1200 different JavaScript assets made up from over twelve thousand different JavaScript files or modules. Like the rest of the industry, we find ourselves relying more and more on JavaScript, which means that a good bit more of our code base ends in “.js” this year than last.

When starting to adopt Webpack, one of the first places we saw an early win was in our development experience. Up to and until this point, our engineers had been using a development server that we had written in-house. We ran a copy of it on every developer machine, where it built files as they were requested. This approach meant that you could reliably navigate around Etsy.com in development without needing to think about a build system at all. It also meant that we could start and restart an instance of the development server without worrying about losing state or interrupting developers much. Conceptually, this made things very simple to maintain.

This is a diagram showing the browser requesting an asset, our build system building that asset synchronously, and that asset being served back to the browser.
You truly couldn’t have asked for a simpler diagram.

In practice, however, developers were asking for more from JavaScript and from their build systems. We started adopting React a few years prior using the then-available JSXTransform tool, which we added to our build system with a fair amount of wailing and gnashing of teeth. The result was a server that successfully, yet sluggishly, supported JSX. Because it wasn’t designed with large applications in mind, our development server didn’t do things like cache transpiled JSX between builds. Building some of our weightier JavaScript code often took the better part of a minute, and most of our developers grew increasingly frustrated with the long iteration cycles it produced. Worse yet, because we were using JSXTransform, rather than something like Babel, our developers could use JSX but weren’t able to use any ES6 syntax like arrow functions or classes.

Bending Webpack to our will.

Clearly, there was a lot with our development environment that could be improved. To be worth the effort of adopting, any new build system we adopted would at least have to support the ability to transpile syntaxes like JSX, while still allowing for fast rebuild times for developers. Webpack seemed like a pretty safe bet — it was widely adopted; it was actively developed and funded; and everyone who had experience with it seemed to like it (in spite of its intimidating configuration).

So, we spent a good bit of time configuring Webpack to work with our codebase (and vice versa). This involved writing some custom loaders for things like templates and translations, and it meant updating some of the older parts of our codebase that relied on the specifics of Require.js to work properly. After a lot of planning, testing, and editing, we were able to get Webpack to fully build our entire codebase. It took half an hour, and that was only when it didn’t fill all 16 gigabytes of our development server’s memory. Clearly, we had a lot more work on our plates.

This is a screenshot of a performance monitoring tool for a server in which 32 processors are maxed out and 20 gigs of ram are used up.
This is one of our beefiest machines maxing out all 32 of its processors and eating up over 20 gigs of memory trying to run Webpack once.

When Webpack typically runs in development mode, it behaves much differently than our old development server did. It starts by compiling all your code as it would for a production build, leaving out optimizations that don’t make sense in development (like minification and compression). It then switches to “watch mode”, where it listens to your source files for changes and kicks off partial recompilations when any of your source code updates. This keeps it from starting from scratch every time an asset updates, and watching the filesystem lets builds start a few seconds before the assets are requested by the browser. Webpack is very effective at partial rebuilds, which is how it’s able to remain fast and effective, even for larger projects.

…and maybe bending our will to Webpack’s.

Although Webpack was designed for large projects, it wasn’t designed for a whole company’s worth of large projects. Our monorepo contains JavaScript code from every part of Etsy. Making Webpack try to build everything at once was a fool’s errand, even after playing with plugins like HardSource, CacheLoader, and HappyPack to either speed up the build time or reduce its resource footprint.

We ended up admitting to ourselves that building everything at once was impossible. If your solution to a problem just barely works today, it’s not going to be very useful when your problem doubles in size in a few years’ time. A pretty straightforward next step would be to split up our codebase into logical regions and make a webpack config for each one, rather than using one big config to build everything. Splitting things up would allow each individual build to be reasonably sized, cutting back on both build times and resource utilization. Plus, production builds wouldn’t need to change much, since Webpack is perfectly happy accepting either a single configuration or an array of them

There was one problem with this approach though: if we only built one slice of the site at a time, we wouldn’t be able to allow developers to easily browse around Etsy.com in development unless they manually started and stopped multiple instances of Webpack. There are a lot of features in Etsy that touch multiple parts of the site; adding a change to how a listing might appear could mean a change for our search page, the seller dashboard, and our internal tools as well. We needed a solution that would both allow us to only build parts of the site that made sense, while maintaining the “it just works!” behavior of our old system.

So, we wrote something we’re calling Kevin.

This is Kevin.

This is a screenshot of an overlay rendered by kevin-middleware. It shows a message that says "Your code is out for delivery" as well as a loading bar.

Kevin (technically “kevin-middleware”) is an express-style middleware that manages multiple instances of Webpack for you. Its job is to make it easier to build a monorepo’s worth of JavaScript while maintaining the resource footprint of something much smaller. It was both inspired by and meant as a replacement to webpack-dev-middleware, which is what Webpack’s own development server uses to manage a single instance of Webpack under the hood. If you happen to be using that, Kevin will probably feel a bit familiar.

Kevin works by reading in a list of Webpack configurations and determining all of the assets that each one could be responsible for. It then listens for requests for those assets, determines the config that is responsible for that asset, and then starts an instance of Webpack with that config. It’ll keep a few instances around in memory based on a simple frecency algorithm, and will monitor your source files in order to eagerly rebuild any changes. When there are more instances than a configured limit, the least used compiler is shut down and cleaned up.

This is a diagram that attempts to visualize the flow that Kevin goes through when a request comes in, as described in the previous paragraph.
While otherwise being a lot cooler in every respect, Kevin has an objectively more complicated diagram.

Webpack’s first build often takes a while. Like I mentioned before, it has to do a first pass of all the assets it needs to build before it’s able to do fast, iterative rebuilds. If a developer requests an asset from a config that isn’t being built by an active compiler, that request might time out before a fresh compiler finishes its first build. Kevin tries to offset this problem by serving some static code that renders an overlay whenever an asset is requested from a compiler that’s still running its first build. The overlay code communicates back with your development server to check on the status of your builds, and automatically reloads the page once everything is complete.

Using Kevin is meant to be really straightforward. If you don’t already have a development server of some sort, creating one with Kevin and Express is maybe a dozen lines of code. Here’s a snippet taken from Kevin’s documentation:

const express = require("express");
const Kevin = require("kevin-middleware");

// This is an array of webpack configs. Each config **must** be named so that we
// can uniquely identify each one consistently. A regular ol' webpack config
// should work just fine as well.
const webpackConfigs = require("path/to/webpack.config.js");

// Setup your server and configure Kevin
const app = express();

const kevin = new Kevin(webpackConfigs, {
    kevinPublicPath = "http://localhost:3000"
});
app.use(kevin.getMiddleware());

// Serve static files as needed. This is required if you generate async chunks;
// Kevin only knows about the entrypoints in your configs, so it has to assume
// that everything else is handled by a different middleware.
app.use("/ac/webpack/js", express.static(webpackConfigs[0].output.path));

// Let 'er rip
app.listen(9275);

We’ve also made a bunch of Kevin’s internals accessible through Webpack’s own tapable plugin system. At Etsy, we use these hooks to integrate with our monitoring system, and to gracefully restart active compilers that have pending updates to their configurations. In this way, we can keep our development server up to date while keeping developer interruptions to a minimum.

Sometimes, a little custom code goes a long way.

In the end, we were able to greatly improve the development experience. Rebuilding our seller tools, which previously took almost a minute on every request, now takes under 30 seconds when we’re starting a fresh compiler, and subsequent requests take only a second or two. Navigating around Etsy.com in development still takes very little interaction with the build system from our engineers. Plus, we can now support all the other things that Webpack enables for us, like ES6, better asset analysis, and even TypeScript.

This is the part where I should mention that Kevin is officially open-source software. Check out the source on Github, and install it from npm as kevin-middleware. If you have any feedback about it, we would welcome an issue on Github. I really hope you get as much use out of it as we did.


This post is the second in a two-part series on our migration to a modern JavaScript build system. The first part can be found here.

2 Comments

The Causal Analysis of Cannibalization in Online Products

Posted by and on February 24, 2020 / 2 Comments

This article mainly draws on our published paper in KDD 2019 (oral presentation, selection rate 6%, 45 out of 700).

Introduction

Nowadays an internet company typically has a wide range of online products to fulfill customer needs.  It is common for users to interact with multiple online products on the same platform and at the same time.  Consider, for example, Etsy’s marketplace. There are organic search, recommendation modules (recommendations), and promoted listings enabling users to find interesting items.  Although each of them offers a unique opportunity for users to interact with a portion of the overall inventory, they are functionally similar and contest the limited time, attention, and monetary budgets of users.

To optimize users’ overall experiences, instead of understanding and improving these products separately, it is important to gain insights into the evidence of cannibalization: an improvement in one product induces users to decrease their engagement with other products.  Cannibalization is very difficult to detect in the offline evaluation, while frequently shows up in online A/B tests.

Consider the following example, an A/B test for a recommendation module.  A typical A/B test of a recommendation module commonly involves the change in the underlying machine learning algorithm, its user interface, or both.  The recommendation change significantly increased users’ clicks on the recommendation while significantly decreasing users’ clicks on organic search results.

Table 1: A/B Test Results for Recommendation Module
(Simulated Experiment Data to Imitate the Real A/B Test)

% Change = Effect/Mean of Control
Recommendation Clicks+28%***
Search Clicks-1%***
Conversion+0.2%
GMS-0.3%

Note: ‘***’ p<0.001, ‘**’ p<0.01, ‘*’ p<0.05, ‘.’ p<0.1.  The two-tailed p-value is derived from the z-test for H0: the effect is zero, which is based on asymptotic normality.

There is an intuitive explanation to the drop in search clicks: users might not need to search as much as usual because they could find what they were looking for through recommendations.  In other words, improved recommendations effectively diverted users’ attention away from search and thus cannibalized the user engagement in search.

Note that increased recommendation clicks did not translate into observed gains in key performance indicators: conversion and Gross Merchandise Sales (GMS).  Conversion and GMS are typically measured at the sitewide level because the ultimate goal of the improvement of any product on our platform is to facilitate a better user experience about etsy.com.  The launch decision of a new algorithm is usually based on the significant gain in conversion/GMS from A/B tests. The insignificant conversion/GMS gain and the significant lift in recommendation clicks challenge product owners when deciding to terminate the new algorithm.  They wonder whether the cannibalization in search clicks could, in turn, cannibalize conversion/GMS gain from recommendation. In other words, it is plausible that the improved recommendations should have brought more significant increases of conversion/GMS than what the A/B test shows, and its positive impact is partially offset by the negative impact from the cannibalized user engagement in search.  If there is cannibalization in conversion/GMS gain, then, instead of terminating it, it is advisable to launch the new recommendation algorithm and revise the search algorithm to work better with the new recommendation algorithm; otherwise, the development of recommendation algorithms would be hurt.

The challenge asks for separating the revenue loss (through search) from the original revenue gain (from the recommendation module change).  Unfortunately, from the A/B tests, we can only observe the cannibalization in user engagement (the induced reduction in search clicks).

Flaws of Purchase-Funnel Based Attribution Metrics

Specific product revenue is commonly attributed based on purchase-funnel/user-journey.  For example, the purchase-funnel of recommendations could be defined as a sequence of user actions: “click A in recommendations → purchase A”.  To compute recommendation-attributed conversion rate, we have to segment all the converted users into two groups: those who follow the pattern of the purchase-funnel and those who do not.  Only the first segment is used for counting the number of conversions.

However, the validity of the attribution is questionable.  In many A/B tests of new recommendation algorithms, it is common for recommendation-attributed revenue change to be over +200% and search-attributed revenue change to be around -1%.  It is difficult to see how the conversion lift is cannibalized and dropped from +200% to the observed +0.2%. These peculiar numbers remind us that attribution metrics based on purchase-funnel are unexplainable and unreliable for at least two reasons.

First, users usually take more complicated journeys than a heuristically-defined purchase-funnel can capture.  Here are two examples:

  1. If the recommendations make users stay longer on Etsy, and users click listings on other pages and modules to make purchases, then the recommendation-attributed metrics fail to capture the contribution of the recommendations to these conversions.  The purchase-funnel is based on “click”, and there is no way to incorporate “dwell time” to the purchase-funnel.
  2. Suppose the true user journey is “click A in recommendation → search A → click A in search results → click A in many other places → purchase A”.  Shall the conversion be attributed to recommendation or search? Shall all the visited pages and modules share the credit of this conversion? Any answer would be too heuristic to be convincing.

Second, attribution metrics cannot measure any causal effects.  The random assignment of users in an A/B test makes treatment and control buckets comparable and thus enables us to calculate the average treatment effect (ATE).  The segments of users who follow the pattern of purchase-funnel may not be comparable between the two buckets, because the segmentation criterion (i.e., user journey) happens after random assignment and thus the segments of users are not randomized between the two buckets.  In causal inference, factors that cause users to follow the pattern of purchase-funnel would be introduced by the segmentation and thus confound the causality between treatment and outcome. Any post-treatment segmentation could break the ignorability assumption of the causal identification and invalidate the causality in experiment analysis (see, e.g., Montgomery et al., 2018).

Causal Mediation Analysis

We exploit search clicks as a mediator in the causal path between recommendation improvement and conversion/GMS, and extend a formal framework of causal inference, causal mediation analysis (CMA), to separate the cannibalized effect from the original effect of the recommendation module change.  CMA splits the observed conversion/GMS gains (average treatment effect, ATE) in A/B tests into the gains from the recommendation improvement (direct effect) and the losses due to cannibalized search clicks (indirect effect). In other words, the framework allows us to measure the impacts of recommendation improvement on conversion/GMS directly as well as indirectly through a mediator such as search (Figure 1).  The significant drop in search clicks makes it a good candidate for the mediator. In practice, we can try different candidate mediators and use the analysis to confirm which one is the mediator.

Figure 1: Directed Acyclic Graph (DAG) to illustrate the causal mediation in recommendation A/B test.

However, it is challenging to implement CMA of the literature directly in practice.  An internet platform typically has tons of online products and all of them could be mediators on the causal path between the tested product and the final business outcomes.  Figure 2 shows that multiple mediators (M1, M0, and M2) are on the causal path between treatment T and the final business outcome Y. In practice, it is very difficult to measure user engagement in all these mediators. Multiple unmeasured causally-dependent mediators in A/B tests break the sequential ignorability assumption in CMA and invalidates CMA (see Imai et al. (2010) for assumptions in CMA).

Figure 2: DAG of Multiple Mediators
Note: M0 and M2 are upstream and downstream mediators of the mediator M1 respectively.

We define generalized average causal mediation effect (GACME) and generalized average direct effect (GADE) to analyze the second cannibalism.  GADE captures the average causal effect of the treatment T that goes through all the channels that do not have M1. GACME captures the average causal effect of the treatment T that goes through all the channels that have M1.  We proved that, under some assumptions, GADE and GACME are identifiable even though there are numerous unmeasured causally-dependent mediators. If there is no unmeasured mediator, then GADE and GACME collapse to ADE and ACME.  If there is, then ADE and ACME cannot be identified while GADE and GACME can.

Table 2 shows the sample results.  The recommendation improvement led to a 0.5% conversion lift, but the cannibalized search clicks resulted in a 0.3% conversion loss, and the observed revenue did not change significantly.  When the outcome is GMS, we can see the loss through cannibalized search clicks as well. The results justify the cannibalization in conversion lift, and serve as evidence to support the launch of the new recommendation module.

Table 2: Causal Mediation Analysis on Simulated Experiment Data

% Change = Effect/Mean of Control
Cannibalization in GainCausal MediationConversionGMS
The Original Gain from recommendationGADE(0) (Direct Component)0.5%*0.2%
The Loss Through SearchGACME(1) (Indirect Component)-0.3***-0.4%***
The Observed GainATE (Total Effect)0.2%-0.3%

Note: ‘***’ p<0.001, ‘**’ p<0.01, ‘*’ p<0.05, ‘.’ p<0.1.  The two-tailed p-value is derived from the z-test for H0: the effect is zero, which is based on asymptotic normality.

The implementation follows a causal mediation-based methodology we recently developed and published on KDD 2019. We also made a fun video describing the intuition behind the methodology.  It is easy to implement and only requires solving two linear regression equations simultaneously (Section 4.4).  We simply need the treatment assignment indicator, search clicks, and the observed revenue for each experimental unit.  Interested readers can refer to our paper for more details and our GitHub repo for analysis code.

We have successfully deployed our model to identify products that are prone to cannibalization.  In particular, it has helped product and engineering teams understand the tradeoffs between search and recommendations, and focus on the right opportunities.  The direct effect on revenue is a more informative key performance indicator than the observed average treatment effect to measure the true contribution of a product change to the marketplace and to guide the decision on the launch of new product features.

2 Comments

The journey to fast production asset builds with Webpack

Posted by on February 3, 2020 / 2 Comments

Etsy has switched from using a RequireJS-based JavaScript build system to using Webpack. This has been a crucial cornerstone in the process of modernizing Etsy’s JavaScript ecosystem. We’ve learned a lot from addressing our multiple use-cases during the migration, and this post is the first of two parts documenting our learnings. Here, we specifically cover the production use-case — how we set up Webpack to build production assets in a way that meets our needs. The second post can be found here.

We’re proud to say that our Webpack-powered build system, responsible for over 13,200 assets and their source maps, finishes in four minutes on average. This fast build time is the result of countless hours of optimizing. What follows is our journey to achieving such speed, and what we’ve discovered along the way.

Production Expectations

One of the biggest challenges of migrating to Webpack was achieving production parity with our pre-existing JavaScript build system, named Builda. It was built on top of the RequireJS optimizer, a build tool predating Webpack, with extensive customization to speed up builds and support then-nascent standards like JSX. Supporting Builda became more and more untenable, though, with each custom patch we added to support JavaScript’s evolving standards. By early 2018, we consequently decided to switch to Webpack; its community support and tooling offered a sustainable means to keep up with JavaScript modernization. However, we had spent many years optimizing Builda to accommodate our production needs. Assets built by Webpack would need to have 100% functional parity with assets built by Builda in order for us to have confidence around switching.

Our primary expectation for any new build system is that it takes less than five minutes on average to build our production assets. Etsy is one of the earliest and biggest proponents of continuous deployment, where our fast build/deploy pipeline allows us to ship changes up to 30 times a day. Builda was already meeting this expectation, and we would negatively impact our deployment speed/frequency if a new system took longer to build our JavaScript. Build times, however, tend to increase as a codebase grows, productionization becomes more complex, and available computing resources are maxed out.

At Etsy, our frontend consists of over 12,000 modules that eventually get bundled into over 1,200 static assets. Each asset needs to be localized and minified, both of which are time-consuming tasks. Furthermore, our production asset builds were limited to using 32 CPU cores and 64GB of RAM. Etsy had not yet moved to the cloud when we started migrating to Webpack, and these specs were of the beefiest on-premise hosts available. This meant we couldn’t just add more CPU/RAM to achieve faster builds.

So, to recap:

We got this.

Localization

From the start, we knew that localization would be a major obstacle to achieving sub-five-minute build times. Localization strings are embedded in our JavaScript assets, and at Etsy we officially support eleven locales. This means we need to produce eleven copies of each asset, where each copy contains localization strings of a specific locale. Suddenly, building over 1,200 assets balloons into building over 1,200 × 11 = 13,200 assets.

General caching solutions help reduce build times, idempotent of localization’s multiplicative factor. After we solved the essential problems of resolving our module dependencies and loading our custom code with Webpack, we incorporated community solutions like cache-loader and babel-loader’s caching options. These solutions cache intermediary artifacts of the build process, which can be time-consuming to calculate. As a result, asset builds after the initial one finish much faster. Still though, we needed more than caching to build localized assets quickly.

One of the first search results for Webpack localization was the now-deprecated i18n-webpack-plugin. It expects a separate Webpack configuration for each locale, leading to a separate production asset build per locale. Even though Webpack supports multiple configurations via its MultiCompiler mode, the documentation crucially points out that “each configuration is only processed after the previous one has finished processing.” At this stage in our process, we measured that a single production asset build without minification was taking ~3.75 minutes with no change to modules and a hot cache (a no-op build). It would take us ~3.75 × 11 = ~41.25 minutes to process all localized configurations for a no-op build.

We also ruled out using this plugin with a common solution like parallel-webpack to process configurations in parallel. Each parallel production asset build requires additional CPU and RAM, and the sum far exceeded the 32 CPU cores and 64GB of RAM available. Even when we limited the parallelism to stay under our resource limits, we were met with overall build times of ~15 minutes for a no-op build. It was clear we need to approach localization differently.

Localization inlining

To localize our assets, we took advantage of two characteristics about our localization. First, the way we localize our JavaScript code is through a module abstraction. An engineer defines a module that contains only key-value pairs. The value is the US-English version of the text that needs to be localized, and the key is a succinct description of the text. To use the localized strings, the engineer imports the module in their source code. They then have access to a function that, when passed a string corresponding to one of the keys, returns the localized value of the text.

example of how we include localizations in our JavaScript

For a different locale, the message catalog contains analogous localization strings for the locale. We programmatically handle generating analogous message catalogs with a custom Webpack loader that applies whenever Webpack encounters an import for localizations. If we wanted to build Spanish assets, for example, the loader would look something like this:

example of how we would load Spanish localizations into our assets

Second, once we build the localized code and output localized assets, the only differing lines between copies of the same asset from different locales are the lines with localization strings; the rest are identical. When we build the above example with English and Spanish localizations, the diff of the resulting assets confirms this:

diff of the localized copies of an asset

Even when caching intermediary artifacts, our Webpack configuration would spend over 50% of the overall build time constructing the bundled code of an asset. If we provided separate Webpack configurations for each locale, we would repeat this expensive asset construction process eleven times.

diagram of running Webpack for each locale

We could never finish this amount of work within our build-time constraints, and as we saw before, the resulting localized variants of each asset would be identical except for the few lines with localizations. What if, rather than locking ourselves into loading a specific locale’s localization and repeating an asset build for each locale, we returned a placeholder where the localizations should go?

code to load placeholders in place of localizations

We tried this placeholder loader approach, and as long as it returned syntactically valid JavaScript, Webpack could continue with no issue and generate assets containing these placeholders, which we call “sentinel assets”. Later on in the build process a custom plugin takes each sentinel asset, finds the placeholders, and replaces them with corresponding message catalogs to generate a localized asset.

diagram of our build process with localization inlining

We call this approach “localization inlining”, and it was actually how Builda localized its assets too. Although our production asset builds write these sentinel assets to disk, we do not serve them to users. They are only used to derive the localized assets.

With localization inlining, we were able to generate all of our localized assets from one production asset build. This allowed us to stay within our resource limits; most of Webpack’s CPU and RAM usage is tied to calculating and generating assets from the modules it has loaded. Adding additional files to be written to disk does not increase resource utilization as much as running an additional production asset build does.

Now that a single production asset build was responsible for over 13,200 assets, though, we noticed that simply writing this many assets to disk substantially increased build times. It turns out, Webpack only uses a single thread to write a build’s assets to disk. To address this bottleneck, we included logic to write a new localized asset only if the localizations or the sentinel asset have changed — if neither have changed, then the localized asset hasn’t changed either. This optimization greatly reduced the amount of disk writing after the initial production asset build, allowing subsequent builds with a hot cache to finish up to 1.35 minutes faster. A no-op build without minification consistently finished in ~2.4 minutes. With a comprehensive solution for localization in place, we then focused on adding minification.

Minification

Out of the box, Webpack includes the terser-webpack-plugin for asset minification. Initially, this plugin seemed to perfectly address our needs. It offered the ability to parallelize minification, cache minified results to speed up subsequent builds, and even extract license comments into a separate file.

When we added this plugin to our Webpack configuration, though, our initial asset build suddenly took over 40 minutes and used up to 57GB of RAM at its peak. We expected the initial build to take longer than subsequent builds and that minification would be costly, but this was alarming. Enabling any form of production source maps also dramatically increased the initial build time by a significant amount. Without the terser-webpack-plugin, the initial production asset build with localizations would finish in ~6 minutes. It seemed like the plugin was adding an unknown bottleneck to our builds, and ad hoc monitoring with htop during the initial production asset build seemed to confirmed our suspicions:

htop during minification

At some points during the minification phase, we appeared to only use a single CPU core. This was surprising to us because we had enabled parallelization in terser-webpack-plugin’s options. To get a better understanding of what was happening, we tried running strace on the main thread to profile the minification phase:

strace during minification

At the start of minification, the main thread spent a lot of time making memory syscalls (mmap and munmap). Upon closer inspection of terser-webpack-plugin’s source code, we found that the main thread needed to load the contents of every asset to generate parallelizable minification jobs for its worker threads. If source maps were enabled, the main thread also needed to calculate each asset’s corresponding source map. These lines explained the flood of memory syscalls we noticed at the start.

Further into minification, the main thread started making recvmsg and write syscalls to communicate between threads. We corroborated these syscalls when we found that the main thread needed to serialize the contents of each asset (and source maps if they were enabled) to send it to a worker thread to be minified. After receiving and deserializing a minification result received from a worker thread, the main thread was also solely responsible for caching the result to disk. This explained the stat, open, and other write syscalls we observed because the Node.js code promises to write the contents. The underlying epoll_wait syscalls then poll to check when the writing finishes so that the promise can be resolved.

The main thread can become a bottleneck when it has to perform these tasks for a lot of assets, and considering our production asset build could produce over 13,200 assets, it was no wonder we hit this bottleneck. To minify our assets, we would have to think of a different way.

Post-processing

We opted to minify our production assets outside of Webpack, in what we call “post-processing”. We split our production asset build into two stages, a Webpack stage and a post-processing stage. The former is responsible for generating and writing localized assets to disk, and the latter is responsible for performing additional processing on these assets, like minification:

running Webpack with a post-processing stage
diagram of our build process with localization inlining and post-processing

For minification, we use the same terser library the terser-webpack-plugin uses. We also baked parallelization and caching into the post-processing stage, albeit in a different way than the plugin. Where Webpack’s plugin reads the file contents on the main thread and sends the whole contents to the worker threads, our parallel-processing jobs send just the file path to the workers. A worker is then responsible for reading the file, minifying it, and writing it to disk. This reduces memory usage and facilitates more efficient parallel-processing. To implement caching, the Webpack stage passes along the list of assets written by the current build to tell the post-processing stage which files are new. Sentinel assets are also excluded from post-processing because they aren’t served to users.

Splitting our production asset builds into two stages does have a potential downside: our Webpack configuration is now expected to output un-minified text for assets. Consequently, we need to audit any third-party plugins to ensure they do not transform the outputted assets in a format that breaks post-processing. Nevertheless, post-processing is well worth it because it allows us to achieve the fast build times we expect for production asset builds.

Bonus: source maps

We don’t just generate assets in under five minutes on average — we also generate corresponding source maps for all of our assets too. Source maps allow engineers to reconstruct the original source code that went into an asset. They do so by maintaining a mapping of the output lines of a transformation, like minification or Webpack bundling, to the input lines. Maintaining this mapping during the transformation process, though, inherently adds time.

Coincidentally, the same localization characteristics that enable localization inlining also enable faster source map generation. As we saw earlier, the only differences between localized assets are the lines containing localization strings. Subsequently, these lines with localization strings are the only differences between the source maps for these localized assets. For the rest of the lines, the source map for one localized asset is equivalently accurate for another because each line is at the same line number between localized assets.

If we were to generate source maps for each localized asset, we would end up repeating resource-intensive work only to result in nearly identical source maps across locales. Instead, we only generate source maps for the sentinel assets the localized assets are derived from. We then use the sentinel asset’s source map for each localized asset derived from it, and accept that the mapping for the lines with localization strings will be incorrect. This greatly speeds up source map generation because we are able to reuse a single source map that applies to many assets.

For the minification transformation that occurs during post-processing, terser accepts a source map alongside the input to be minified. This input source map allows terser to account for prior transformations when generating source mappings for its minification. As a result, the source map for its minified results still maps back to the original source code before Webpack bundling. In our case, we pass terser the sentinel asset’s source map for each localized asset derived from it. This is only possible because we aren’t using terser-webpack-plugin, which (understandably) doesn’t allow mismatched asset/source map pairings.

diagram of our complete build process with localization inlining, post-processing, and source map optimizations

Through these source map optimizations, we are able to maintain source maps for all assets while only adding ~1.7 minutes to our build time average. Our unique approach can result in up to a 70% speedup in source map generation compared to out-of-the-box options offered by Webpack, a dramatic reduction in the time.

Conclusion

Our journey to achieving fast production builds can be summed up into three principles: reduce, reuse, recycle.

While some implementation details may become obsolete as Webpack and the frontend evolve, these principles will continue to guide us towards faster production builds.

This post is the first in a two-part series on our migration to a modern JavaScript build system. The second part can be found here.

2 Comments

G-Scout Enterprise and Cloud Security at Etsy

Posted by on November 18, 2019

As companies are moving to the cloud, they are finding a need for security tooling to audit and analyze their cloud environments. Over the last few years, various tools have been developed for this purpose. We’ll look at some of them and consider the uses for them. Specifically, we’ll take a close look at G-Scout, a tool I developed while working at NCC Group to look for security misconfigurations in Google Cloud Platform (GCP); and G-Scout Enterprise, a new tool with the same purpose, but tailored to the needs of security engineers at Etsy. We’ll also consider G-Scout Enterprise’s role within an ecosystem of other cloud logging and monitoring tools used at Etsy.

Cloud environments have a convenient feature which you won’t get from on premise servers: they have APIs. It’s similar for all the major cloud providers. They have a REST API which provides information on what services are being used, what resources exist, and how they are configured. An authorized user can call these APIs through a command line tool, or programmatically through a client library.

Those APIs provide information which is useful for security purposes. A classic example is a storage bucket (S3, GCS, etc.) which has been made public. It could be publicly readable, or publicly writable. Since we can use the API to see the permissions on any bucket we own, we can look for misconfigured permissions. So we go through all the API data we have for all our storage buckets, and look for permissions assigned to allUsers, or allAuthenticatedUsers.

Here are some other common examples:

Configuration Scanning Tools

Rather than making API calls and processing the data ad hoc, you can create a framework. A tool that will allow you, with a single command, to run various API calls to gather data on diverse resources, and then programmatically look for misconfigurations in that data. And in the end, you can have the tool place the results into a human-readable HTML report which you can browse according to your whims.

Scout 2 does all of the above for Amazon Web Services (AWS). G-Scout was created with a similar solution in mind as Scout 2, but for GCP. After Scout 2 there have followed plenty of other examples. Some, like G-Scout, have been open source, and others are available for purchase.

These tools continue to evolve. It is becoming increasingly common for companies to use more than one cloud provider. With this trend we’ve seen the creation of multi-cloud tools. Scout Suite has replaced Scout 2. Inspec supports AWS, Azure, and GCP.

And some of them have added features. Forseti Inventory stores the data collected in a SQL database (I’ve moved G-Scout in a similar direction, as we’ll see later). Forseti Enforcer will actually make changes to match policies. 

These features are useful, but not so much to a consultant, since a consultant shouldn’t want any permissions aside from viewer permissions. Scout 2 was designed for consulting. The user can get viewer permissions, run the tool, and leave no trace. Forseti, on the other hand, requires Organization Admin permissions, and creates a database and other resources within the organization that is being audited.

Difficulties With G-Scout

But the same basic functionality remains at the core of each of these tools. When it came to G-Scout, that core functionality worked well for smaller companies, or those less committed to GCP. But when there are hundreds of projects, thousands of instances, and many other resources, it becomes difficult to go through the results. 

Adding to this difficulty is the presence of false positives. Any automated tool is going to turn up false positives. Context may make something that seems like a finding at first glance, instead turn out to be acceptable. To return to our public storage bucket example, there are some cases where the content in the bucket is intended to be public. You can even serve a simple HTML website from a storage bucket. So it tends to fall to a human to go through and figure out which are false positives. Since it takes time to fix real findings, and the false positives don’t go away, running the tool frequently to see what’s new becomes untenable.

Finally, at Etsy, many of the findings G-Scout would turn up had already been found by other means, which we will explore a bit below.

We have a tool called Reactor. There is a stackdriver log sink for the organization, and those logs (with filters applied) go to a PubSub topic. There’s a cloud function that subscribes to that topic, and when it finds logs that match any of a further set of filters (the alerting rules) then it triggers an alert.

So for example, if someone makes a storage bucket public, an alert will trigger as soon as the corresponding stackdriver log is generated, rather than waiting for someone to run G-Scout at some point.

Here’s a partial example of a stackdriver log. As an API call to check IAM permissions would, it has all the information we need to trigger an alert. We see the user that granted the permission (in this case a service account). And below the fold we would see which role was assigned and which user it was assigned to.

Another point where we are alerting on misconfigurations is resource creation. We use Terraform for infrastructure as code. Before a Terraform apply is run, we have a series of unit tests that will be run by the pipeline. The unit tester runs tests for many of the same events which we alert on with the stackdriver logs. This includes the common example of a bucket being made public.

This is another process that is not so useful for a security consultant. But it’s better to catch misconfigurations in this way, than in the way Scout 2 or G-Scout would catch them, since this will prevent them from ever being implemented!

So we have what I’ll call a three-pronged approach to catching misconfigurations in GCP. These are the three prongs:

In summary, G-Scout’s traditional purpose was proving minimally useful. It was difficult to make good use of the G-Scout reports. And as we’ve seen, the first two prongs will usually catch misconfigurations first. So I moved away from G-Scout, and toward a new creation: G-Scout Enterprise.

G-Scout Enterprise

The fundamental change is to replace the HTML report with a BigQuery data collection. In fact, at its core, G-Scout Enterprise is very simple. It’s mostly just something that takes API data and puts it into BigQuery. Then other systems can do with that data as they please. The rules that will trigger alerts can be written in our alerting system like any other alerts we have (though they can also easily be written in Python within G-Scout Enterprise). We are now putting all of our other data into BigQuery as well, so it’s all connected.

Users can query any of the tables, each of which corresponds to one GCP API endpoint. G-Scout Enterprise tables can be joined – and they can be joined to our other data sources as well. And we can be very specific: like looking for all roles where amellos@etsy.com is a member, without enshrining it in our ruleset, because we can run queries through the BigQuery console. Or we can run queries in the command line, with helper functions that allow us to query with Python rather than SQL.

We can make comparisons and track changes over time. It can also provide context to other alerts. For example, if we have an IP address from an SSH alert, we can get information about the instance which owns that IP address, such as what service account it has, or what Chef role it has. 

Or for instance, the following, more complicated scenario:

We run Nessus. Nessus is an automated vulnerability scanner. It has a library of vulnerabilities it looks for by making network requests. You give it a list of IPs and it goes through them all. We now have it running daily. With a network of any size the volume of findings will quickly become overwhelming. Many of them are informational or safely ignored. But the rest need to be triaged, and addressed in a systematic way.

Not all Nessus findings are created equal. The same vulnerability on two different instances may be much more concerning on one than the other: if one is exposed to the internet and the other is not; if one is working with PII and the other is not; if one is in development and the other in production, and so on. Most of the information which determines how concerned we are with a vulnerability can be found among the collection of data contained in G-Scout Enterprise. This has simplified our scanning workflow. Since we can do network analysis with the data in G-Scout Enterprise, we can identify which instances are accessible from where. That means we don’t have to scan from different perspectives. And it has improved the precision of our vulnerability triaging, since there is so much contextual data available.

So we go through the following process:

  1. Enumerate all instances in our GCP account.
  2. Discard duplicate instances (instances from the same template, e.g. our many identical web server instances).
  3. Run the Nessus scan and place the results into BigQuery.
  4. Create a joined table of firewall rules and instances which they apply to (matching tags).
  5. Take various network ranges (0.0.0.0/0, our corporate range, etc.), and for each firewall rule see if it allows traffic from that source.
  6. For instances with firewall rules that allow ingress from 0.0.0.0/0, see if the instance has a NatIP or is behind an external load balancer.
  7. Check whether the project the instance lives in is one of the projects classified as sensitive.
  8. Compute and assign scores according to the previous steps

And then we save the results into BigQuery. That gives us historical data. We can see if we are getting better or worse. We can see if we have certain troublemaker projects. We can empower our patch management strategy with a wealth of data.

Conclusion

That leaves us with a few main lessons gained from adapting G-Scout to Etsy:

One last note is that we have plans to open source G-Scout Enterprise in the coming months.

No Comments

Engineering Career Development at Etsy

Posted by on October 2, 2019 / No Responses

In late May of 2018, Etsy internally released an Engineering Career Ladder. Today, we’re sharing that ladder publicly and detailing why we decided to build it, why the content is what it is, and how it’s been put into use since its release.

Take a look

Defining a Career Ladder

A career ladder is a tool to help outline an engineer’s path for growth within a company. It should provide guidance to engineers on how to best take on new responsibilities, and allow their managers to assess and monitor performance and behavior. A successful career ladder should align career progression with a company’s culture, business goals, and guiding principles and act as a resource to guide recruiting, training, and performance assessments.

Etsy has had several forms of a career ladder before this iteration. The prior career ladders applied to all Etsy employees, and had a set of expectations for every employee in the same level across all disciplines. Overall, these previous ladders worked well for Etsy as a smaller company, but as the engineering team continued to grow we found the content needed updating to meet practical expectations, as the content in the ladder started to feel too broad and unactionable.

As a result, we developed this career ladder, specific to engineering, to allow us to be more explicit with those expectations and create a unified understanding of what it means to be an engineer at a certain level at Etsy. This ladder has been in place for over a year now, and in that time we’ve gone through performance reviews, promotion cycles, lots of hiring, and one-on-one career development conversations. We’re confident that we’ve made a meaningful improvement to engineering career development at Etsy and hope that releasing this career ladder publicly can help other companies support engineering career growth as well.

Designing the Etsy Engineering Career Ladder

We formed a working group focused on creating a new iteration of the career ladder comprised of engineers and engineering managers of various levels. The working group included Miriam Lauter, Dan Auerbach, and Jason Wain, and me. We started by exploring our current company-wide career ladder, discussing its merits and limitations, and the impact it had on engineering career development. We knew that any new version needed to be unique to Etsy, but we spent time exploring publicly available ladders of companies who had gone through a similar process in an effort to understand both tactical approaches and possible formats. Many thanks specifically to Spotify, Kickstarter, Riot Games, and Rent the Runway for providing insight into their processes and outcomes. Reviewing their materials was invaluable.

We decided our first step was to get on the same page as to what our goals were, and went through a few exercises resulting in a set of tenets that we felt would drive our drafting process and provide a meaningful way to evaluate the efficacy of the content. These tenets provided the foundation to our approach for developing the ladder.

The Tenets

Support meaningful career growth for engineers

Our career ladder should be clear enough, and flexible enough, to provide direction for any engineer at the company. We intended this document to provide actionable steps to advance your career in a way that is demonstrably impactful. Ideally, engineers would use this ladder to reflect on their time at Etsy and say “I’ve developed skills here I’ll use my entire career.”

Unify expectations across engineering

We needed to build alignment across the entire engineering department about what was required to meet the expectations of a specific level. If our career ladder were too open to interpretation it would cause confusion, particularly as it relates to the promotion process. We wanted to ensure that everyone had a succinct, memorable way to describe our levels, and understand exactly how promotions happen and what is expected of themselves and their peers.

Recognize a variety of valid career paths

Whether you’re building machine learning models or localizing our products, engineering requires skills across a range of competencies, and every team and project takes individuals with strengths in each. We wanted to be explicit about what we believe about the discipline, that valid and meaningful career paths exist at all levels for engineers who bring differences of perspectives and capabilities, and that not everyone progresses as an engineer in the same way. We intended to codify that we value growth across a range of competencies, and that we don’t expect every person to have the same set of strengths at specific points in their career.

Limit room for bias in how we recognize success

A career ladder is one in a set of tools that can help an organization mitigate potential bias. We needed to be thoughtful about our language, ensuring that it is inclusive, objective, and action oriented. We knew the career ladder would be used as basis for key career advancement moments, such as hiring and promotions, so developing a clear and consistent ladder was critical for mitigating potential bias in these processes.

Developing the Etsy Engineering Career Ladder

With these tenets in place, we had the first step towards knowing what was necessary for success. In addition to creating draft ladder formats, we set about determining how we could quantify the improvements that we were making. We outlined key areas where we’d need to directly involve our stakeholders, including engineering leadership, HR, Employee Resource Groups, and of course engineers. We made sure to define multiple perspectives for which the ladder should be a utility; e.g. an engineer looking to get promoted, a manager looking to help guide an engineer to promotion, or a manager who needed to give constructive performance feedback.

Implicit biases can be notoriously difficult to acknowledge and remove from these processes, and we knew that in order to do this as best as possible we’d need to directly incorporate feedback from many individuals, both internal and external, across domains and disciplines, and with a range of perspectives, to assure that we were building those perspectives into the ladder.

Our tactics for measuring our progress included fielding surveys and requests for open feedback, as well as direct 1:1 in-depth feedback sessions and third party audits to ensure our language was growth-oriented and non-idiomatic. We got feedback on structure and organization of content, comprehension of the details within the ladder, the ladder’s utility when it came to guiding career discussions, and alignment with our tenets.

The feedback received was critical in shaping the ladder. It helped us remove duplicative, unnecessary, or confusing content and create a format that we thought best aligned with our stated tenets and conveyed our intent. 

And finally, the Etsy Engineering Career Ladder

You can find our final version of the Etsy Engineering Career Ladder here.

The Etsy Engineering Career Ladder is split into two parts: level progression and competency matrix. This structure explicitly allows us to convey how Etsy supports a variety of career paths while maintaining an engineering-wide definition of each level. The level progression is the foundation of the career ladder. For each level, the ladder lays out all requirements including expectations, track record, and competency guidelines. The competency matrix lays out the behaviors and skills that are essential to meeting the goals of one’s role, function, or organization.

Level Progression

Each section within the level progression provides a succinct definition of the requirements for an engineer with that title. It details a number of factors, including the types of problems an engineer is solving, the impact of their work on organizational goals and priorities and how they influence others that they work with. For levels beyond Engineer I, we outline an expected track record, detailing achievements over a period of time in both scale and complexity. And to set expectations for growth of competencies, we broadly outline what levels of mastery an engineer needs to achieve in order to be successful.

Competencies

If the level progression details what is required of an engineer at a certain level, competencies detail how we expect they can meet those expectations. We’ve outlined five core competency areas:

For each of these five competency areas, the competency matrix provides a list of examples that illustrate what it means to have achieved various levels of mastery. Mastery of a competency is cumulative — someone who is “advanced” in problem solving is expected to retain the skills and characteristics required for an “intermediate” or “beginner” problem solver.

Evaluating our Success

We internally released this new ladder in May of 2018. We did not immediately make any changes to our performance review processes, as it was critical to not change how we were evaluating success in the middle of a cycle. We merely released it as a reference for engineers and their managers to utilize when discussing career development going forward. When our next performance cycle kicked off, we began incorporating details from the ladder into our documentation and communications, making sure that we were using it to set the standards for evaluation.

Today, this career ladder is one of the primary tools we use for guiding engineer career growth at Etsy. Utilizing data from company-wide surveys, we’ve seen meaningful improvement in how engineers see their career opportunities as well as growing capabilities for managers to guide that growth.

Reflecting on the tenets outlined at the beginning of the process allows us to look back at the past year and a half and recognize the change that has occurred for engineers at Etsy and evaluate the ladder against the goals we believed would make it a success. Let’s look back through each tenet and see how we accomplished it.

Support meaningful career growth for engineers

While the content is guided by our culture and Guiding Principles, generally none of the competencies are Etsy-specific. The expectations, track record, and path from “beginner” to “leading expert” in a competency category are designed to show the growth of an engineer’s impact and recognize accomplishments that they can carry throughout their career, agnostic of their role, team, or even company.

The competency matrix also allows us to guide engineer career development within a level. While a promotion to a new level is a key milestone that requires demonstration of meeting expectations over time, advancing your level of mastery by focusing on a few key competencies allows engineers to demonstrate continual growth, even within the same level. This encourages engineers and their managers to escape the often insurmountable task of developing a plan to achieve the broader set of requirements for the next promotion, and instead create goals that help them get there incrementally.

Compared to our previous ladder, the path to Staff Engineer is no longer gated by the necessity to increase one’s breadth. We recognized that every domain has significantly complex, unscoped problems that need to be solved, and that we were limiting engineer growth by requiring those who were highly successful in their domain to expand beyond it. Having expectations outlined as they are now allows engineers the opportunity to grow by diving more deeply into their current domains.

Unify expectations across engineering

The definition for each level consists only of a few expectations, a track record, and guidelines for level of mastery of competencies. It is easy to parse, and to refer back to to get a quick understanding of the requirements. With a little reflection, it should be easy to describe how any engineer meets the three to five expectations of their level.

Prior to release, we got buy-in from every organizational leader in engineering that these definitions aligned with the reality of the expectations of engineers in their org. Since release we’ve aligned our promotion process to the content in the ladder. We require managers to outline how a candidate has met the expectations over the requisite period stated in the track record for their new level, and qualify examples of how they demonstrate the suggested level of mastery for competencies.

Recognize a variety of valid career paths

We ask managers to utilize the competencies document with their reports’ specific roles in mind when talking about career progression. Individual examples within the competency matrix may feel more or less applicable to individual roles, such as a Product Engineer or a Security Engineer, and this adaptability allows per-discipline growth while still aligning with the behaviors and outcomes we agree define a level of mastery. A small set of example skills is provided for each competency category that can help to better contextualize the application of the competencies in various domains. Additionally, we intentionally do not detail any competencies for which success is reliant on your team or organization.

Allowing managers to embrace the flexibility inherent in the competency matrix and its level of mastery system has allowed us to universally recognize engineer growth as it comes in various forms, building teams that embrace differences and value success in all its shapes. Managers can grow more diverse teams, for instance, by being able to recognize engineering leaders who are skilled domain experts, driving forward technical initiatives, and other engineering leaders who are skilled communicators, doing the glue work and keeping the team aligned on solving the right problems. We recognize that leadership takes many forms, and that is reflected in our competency matrix.

Limit room for bias in how we recognize success

The career ladder is only a piece of how we can mitigate potential bias as an organization. There are checks and balances built into other parts of Etsy’s human resources processes and career development programs, but since a career ladder plays such a key role in shaping the other processes, we approached this tenet very deliberately.

The competencies are not personality based, as we worked to remove anything that could be based on subjective perception of qualities or behaviors, such as “being friendly.” All content is non-idiomatic, in an effort to reduce differences in how individuals will absorb or comprehend the content. We also ensured that the language was consistent between levels by defining categories for each expectation. For instance, defining the expected complexity of the problems engineers solve per level allowed us to make sure we weren’t introducing any leaps in responsibility between levels that couldn’t be tied back to growth in the previous level. 

We also explicitly avoided any language that reads as quantifiable (e.g. “you’ve spoken at two or more conferences”) as opportunities to achieve a specific quantity of anything can be severely limited by your role, team, or personal situation, and can lead to career advice that doesn’t get at the real intent behind the competency. Additionally, evaluation of an individual against the ladder, for instance as part of a promotion, is not summarized in numbers. There is no score calculation or graphing an individual on a chart, nor is there an explicit number of years in role or projects completed as an expectation. While reducing subjectivity is key to mitigating potential bias, rigid numerical guidelines such as these can actually work against our other tenets by not allowing sufficient flexibility given an individual’s role.

Most importantly, the ladder was shaped directly through feedback from Etsy engineers, who have had direct personal experiences with how their individual situations may have helped or hindered their careers to draw on.

We’re really passionate about supporting ongoing engineer career growth at Etsy, and doing it in a way that truly supports our mission. We believe there’s a path to Principal Engineer for every intern and that this ladder goes a long way in making that path clear and actionable. We hope this ladder can serve as an example, in addition to those we took guidance from, to help guide the careers of engineers everywhere.

If you’re interested in growing your career with us, we’d love to talk, just click here to learn more.

No Comments

Apotheosis: A GCP Privilege Escalation Tool

Posted by on September 25, 2019 / No Responses

The Principle of Least Privilege

One of the most fundamental principles of information security is the principle of least privilege. This principle states that users should only be given the minimal permissions necessary to do their job. A corollary of the principle of least privilege is that users should only have those privileges while they are actively using them. For especially sensitive actions, users should be able to elevate their privileges within established policies, take sensitive actions, and then return their privilege level to normal to resume normal usage patterns. This is sometimes called privilege bracketing when applied to software, but it’s also useful for human users.

Following this principle reduces the chance of accidental destructive actions due to typos or misunderstandings. It may also provide some protection in case the user’s credentials are stolen, or if the user is tricked into running malicious code. Furthermore, it can be used as a notice to perform additional logging or monitoring of user actions.

In Unix this takes the form of the su command, which allows authorized users to elevate their privileges, take some sensitive actions, and then reduce their permissions. The sudo command is an even more fine-grained approach with the same purpose, as it will elevate privileges for a single command. 

Some cloud providers have features that allow for temporary escalation of privileges. Authorized users can take actions with a role other than the one which is normally assigned to them. The credentials used to assume a role are temporary, so they will expire after a specified amount of time. However, we did not find a built-in solution to achieve the same functionality in Google Cloud Platform (GCP).

Enter Apotheosis

Apotheosis is a tool that is meant to address the issues above. The word apotheosis means the elevation of someone to divine status. It’s possible, and convenient, to give users permanent “godlike” permissions, but this is a violation of the principle of least privilege. This tool will allow us to “apotheosize” users, and then return them to a “mortal” level of privilege when their job duties no longer require additional privileges.

Users or groups can be given “actual permissions” and “eligible permissions”. For example, a user who currently has the owner role may instead be given only the viewer role, and we will call that their “actual permissions”. Then we can give them “eligible permissions” of owner, which will come in the form of the service account token creator role on a service account with the editor or organization admin role.

For this user to elevate their privileges, the Apotheosis command line program will use their GCP credentials to call the REST API to create a short-lived service account token. Then, using that token, Apotheosis will make another REST API call which will grant the requested permissions to the user. Or, alternatively, the permissions may be granted to a specified third party, allowing the Apotheosis user to leverage their eligible permissions to grant actual permissions to another entity. The program will wait for a specified amount of time, remove the requested permissions, and then delete the short-lived service account token.

This process has the following advantages:

Future Additions

Some additional features which may be added to Apotheosis are contingent on the launch of other features, such as conditional IAM. Conditional IAM will allow the use of temporal restrictions on IAM grants, which will make Apotheosis more reliable. With conditional IAM, if Apotheosis is interrupted and does not revoke the granted permissions, they will expire anyway.

The ability to allow restricted permissions granting will be a useful IAM feature as well. Right now a user or service account can be given a role like editor or organization admin, and then can grant any other role in existence. But if it were possible to allow granting a predefined list of roles, that would make Apotheosis useful for a larger set of users. As it is now, Apotheosis is useful for users who have the highest level of eligible privilege, since their access to the Apotheosis service account gives them all the privileges of that service account. That is, the scope of those privileges can be limited to a particular project, folder, or organization, but cannot be restricted to a limited set of actions. At the moment that service account must have one of the few permissions which grant the ability to assign any role to any user. 

Requiring two-factor authentication when using the short-lived service account token feature on a particular service account would be another useful feature. This would require an Apotheosis user to re-authenticate with another factor when escalating privileges.

Open Source

Apotheosis is open source and can be found on Github.

No Comments

Code as Craft: Understand the role of Style in e-commerce shopping

Posted by and on August 2, 2019 / No Responses

Aesthetic style is key to many purchasing decisions. When considering an item for purchase, buyers need to be aligned not only with the functional aspects (e.g. description, category, ratings) of an item’s specification, but also its aesthetic aspects (e.g. modern, classical, retro) as well. Style is important at Etsy, where we have more than 60 million items and hundreds of thousands of them can differ by style and aesthetic. At Etsy, we strive to understand the style preferences of our buyers in order to surface content that best fits their tastes.

Our chosen approach to encode the aesthetic aspects of an item is to label the item with one of a discrete set of “styles” of which “rustic”, “farmhouse”, and “boho” are examples. As manually labeling millions of listings with a style class is not feasible – especially in a marketplace that is ever changing, we wanted to implement a machine learning model that best predicts and captures listings’ styles. Furthermore, in order to serve style-inspired listings to our users, we leveraged the style predictor to develop a mechanism to forecast user style preferences.

Style Model Implementation

Merchandising experts identified style categories.

For this task, the style labels are one of the classes that have been identified by our merchandising experts. Our style model is a machine learning model which, when given a listing and its features (text and images), can output a style label. The style model was designed to not only output these discrete style labels but also a multidimensional vector representing the general style aspects of a listing. Unlike a discrete label (“naval”, “art-deco”, “inspirational”) which can only be one class, the style vector encodes how a listing can be represented by all these style classes in varying proportions. While the discrete style labels can be used in predictive tasks to recommend items to users from particular style classes (say filtering recommended listings to a user from just “art-deco”), the style vector is supposed to serve as a machine learning signal into our other recommendation models. For example, on a listing page on Etsy, we recommend similar items. This model can now surface items that are not only functionally the same (“couch” for another “couch”) but can potentially recommend items that are instead from the same style (“mid-century couch” for a “mid-century dining table”).

The first step in building our listing style prediction model was preparing a training data set. For this, we worked with Etsy’s in-house merchandising experts to identify a list of 43 style classes. We further leveraged search visit logs to construct a “ground truth” dataset of items using these style classes. For example, listings that get a click, add to cart or purchase event for the search query “boho” are assigned the “boho” class label. This gave us a large enough labeled dataset to train a style predictor model.

Style Deep Neural Network

Once we had a ground truth dataset, our task was to build a listing style predictor model that could classify any listing into one of 43 styles (it is actually 42 styles and a ‘everything else’ catch all). For this task, we used a two layer neural network to combine the image and text features in a non-linear fashion. The image features are extracted from the primary image of a listing using a retrained Resnet model. The text features are the TF-IDF values computed on the titles and tags of the items. The image and text vectors are then concatenated and fed as input into the neural network model. This neural network model learns non-linear relationships between text and image features that best predict a listings style. This Neural Network was trained on a GPU machine on Google Cloud and we experimented with the architecture and different learning parameters until we got the best validation / test accuracy.


By explicitly taking style into account the nearest neighbors are more style aligned

User Style

As described above, the style model helps us extract low-dimension embedding vectors that capture this stylistic information for a listing, using the penultimate layer of the neural network. We computed the style embedding vector using the style model for all the listings in Etsy’s corpus.

Given these listing style embeddings, we wanted to understand users’ long-term style preferences and represent it as a weighted average of 42 articulated style labels. For every user, subject to their privacy preferences, we first gathered the entire history of “purchased”, “favorited”, “clicked” and “add to cart” listings in the past three months. From all these listings that a user interacted with, we combined their corresponding style vectors to come up with a final style representation for each user (by averaging them).

Building Style-aware User Recommendations

There are different recommendation modules on Etsy, some of which are personalized for each user. We wanted to leverage user style embeddings in order to provide more personalized recommendations to our users. For recommendation modules, we have a two-stage system: we first generate a candidate set, which is a probable set of listings that are most relevant to a user. Then, we apply a personalized ranker to obtain a final personalized list of recommendations.  Recommendations may be provided at varying levels of personalization to a user based on a number of factors, including their privacy settings.

In this very first iteration of user style aware recommendations, we apply user style understanding to generate a candidate set based on user style embeddings and their latest interacted taxonomies. This candidate set is used for Our Picks For You module on the homepage. The idea is to combine the understanding of a user’s long time style preference with his/her recent interests in certain taxonomies.

This work can be broken down into three steps:

Given user style embeddings, we take top 3 styles with the highest probability to be the “predicted user style”. Latest taxonomies are useful because they indicate users’ recent interests and shopping missions.

Given a taxonomy, sort all the listings in this taxonomy by the different style prediction scores for different classes, high to low. We take the top 100 listings out of these.

Minimal” listings in “Home & Living”

Floral” listings in “Home & Living”

Taxonomy, style validation is to check whether a style makes sense for a certain taxonomy. eg. Hygge is not a valid style for jewelry.

These become the style based recommendations for a user.

1-4: boho + bags_and_purses.backpacks
5-7: boho + weddings.clothing
8,13,16: minimal + bags_and_purses.backpacks

Style Analysis 

We were extremely interested to use our style model to answer questions about users sense of style. Our questions ranged from “How are style and taxonomy related? Do they have a lot in common?”, “Do users care about style while buying items?” to “How do style trends change across the year?”. Our style model enables us to answer at least some of these and helps us to better understand our users. In order to answer these questions and dig further we leveraged our style model and the generated embeddings to perform analysis of transaction data.

Next, we looked at the seasonality effect behind shopping of different styles on Etsy. We began by looking at unit sales and purchase rates of different styles across the year. We observed that most of our styles are definitely influenced by seasonality. For example, “Romantic” style peaks in February because of Valentines Day and “Inspirational” style peaks during graduation season. We tested the unit sales time series of different styles for statistical time series-stationarity test and found that the majority of the styles were non-stationary. This signifies that the majority of styles show different shopping trends throughout the year and don’t have constant unit sales throughout the year. This provided further evidence that users tastes show different trends across the year.




Using the style embeddings to study user purchase patterns not only provided us great evidence that users care about style, but also inspired us to further incorporate style into our machine learning products in the future.

Etsy is a marketplace for millions of unique and creative goods. Thus, our mission as machine learning practitioners is to build pathways that connect the curiosity of our buyers with the creativity of our sellers. Understanding both listing and user styles is another one of our novel building blocks to achieve this goal.

For further details into our work you can read our paper published in KDD 2019.

Authors: Aakash Sabharwal, Jingyuan (Julia) Zhou & Diane Hu


No Comments