How Etsy Uses Thermodynamics to Help You Search for “Geeky”

Posted by on August 31, 2015

Etsy shoppers love the large and diverse selection of our marketplace. But, for those who don’t know exactly what they’re looking for, the sheer number and variety of items available can be more frustrating than delightful. In July, we introduced a new user interface which surfaces the top categories for a search request to help users explore the results for queries like “gift.” Searchers who issue broad queries like this often don’t have a specific type of item in mind, and are especially likely to finish their visit empty-handed. Our team lead, Gio, described our motivations and process in an (excellent) blog post last month, which gives more background on the project. In this post, I’ll focus on how we developed and iterated on our heuristic for classifying queries as “broad.”

Our navigable interface, shown for a query for “geeky gift”

Our navigable interface, shown for a query for “geeky gift”

Quantifying “broadness”

When I describe what I’m working on to people outside the team, they often jump in with a guess about how we use machine learning techniques to determine which queries are broad in code. While we could have used complex, offline signals like click or purchasing behavior to learn which queries should trigger the category interface, we actually base the decision on a single calculation, evaluated entirely at runtime, which uses very basic statistics about the search result set.

There have been several advantages to sticking with a simpler metric. By avoiding query-specific behavioral signals, our approach works for all languages and long-tail queries out of the gate. It’s performant and relatively easy to debug. It’s also (knock on wood) stable and easy to maintain, with very few external dependencies or moving parts. I’ll explain how we do it, and arguably justify the title of this post in the process.

Let’s take “geeky” as an example of a broad query, one that tells us very little about what type of item the user is looking for. Jewelry is the top category for “geeky,” but there are many items in all of the top-level categories.Top Categories for "Geeky" by Result Count

Compare to the distribution of results for “geeky mug,” which are predictably concentrated in the Home & Living category.

Top Categories for "Geeky Mug" by Result Count

In plainspeak, the calculation we use measures how spread out across the marketplace the items returned for the query are. The distribution of results for “geeky” suggests that the user might benefit from seeing the top categories, which demonstrate the breadth of geeky paraphernalia available on the site, from a periodic table-patterned bow tie to a “cutie pi” mug. The distribution for “geeky mug” is dominated by one category, and shouldn’t trigger the category interface.

The categories shown for a query for “geeky”

The categories shown for a query for “geeky”

Doing the math

In order to quantify how “spread out” items are, we start by taking the number of results returned for the query in each of the top-level categories and deriving the probability that an item is in each category. Since 20% of the items returned are in the Jewelry category and 15% of items are in the Accessories category, the probability values for Jewelry and Accessories would be .2 and .15 respectively. We use these values as the inputs to the Shannon entropy formula:

Shannon entropy formula

Shannon entropy formula

This formula is a measure of the disorder of a probability distribution. It’s essentially equivalent to the formula used to calculate the thermodynamic entropy of a physical system, which models a similar concept.

For our purposes, let rt be the total number of results and ri be the number of results for category i. Then the probability value in the above equation would be (ri /rt) and the entropy of the distribution of a search result set across its categories can be expressed as:

Search result entropy

Entropy of a search result set

In this way, we can determine when to show categories without using any offline signals. This is not to say that we didn’t use data in our development process at all. To determine the entropy threshold above which we should show categories, we looked at the entropies for a large sample of queries and made a fairly liberal judgement call on a good dividing line (i.e. a low threshold). Once we had results from an AB experiment which showed the new interface to real users, we looked to see how it affected user behavior for queries with lower entropy levels, and refined the cut-off based on the numbers. But this was a one-off analysis; we expect the threshold to be static over time, since the distribution of our marketplace across categories changes slowly.

Taking it to the next level

A broad query may not necessarily have high entropy at the top level of our taxonomy. Results for “geeky jewelry” are unsurprisingly concentrated in our Jewelry category, but there are still many types of items that are returned. We’d like to guide users into more specific subcategories, like Earrings and Necklaces, so we introduced a secondary round of entropy calculations for queries that don’t qualify as broad at the top level. It works like this: if the result set does not have sufficiently high entropy to trigger the categories at the top level, we determine the entropy within the most populous category (i.e. the entropy of its subcategories) and show those subcategories if that value exceeds our entropy cut-off.

Top Subcategories for "Jewelry" By Count

The graph above demonstrates the level of spread of results for the query “jewelry” across subcategories of the top-level Jewelry category. This method allows us to dive into top-level categories in cases like this, while sticking to a simple runtime decision based on category counts.

Showing subcategories for a query for “geek jewelry”

Showing subcategories for a query for “geek jewelry”

Iterating on entropy

While we were testing this approach, we noticed that a query like “shoes,” which we hoped would be high entropy within the Shoes category, was actually high entropy at the top level.

Top-level categories for the “shoes” query

Top-level categories for the “shoes” query… doesn’t seem quite right

Items returned for “shoes” are apparently sufficiently spread across the whole marketplace to trigger top-level groups, although there are an unusually high number of items in the Shoes category.

Top Categories for "Shoes" by Result Count

More generally, items in our marketplace tend to be concentrated in the most popular categories. A result set is likely to have many more Accessories items than Shoes items, because the former category is an order of magnitude larger than the latter. We want to be able to compensate for this uneven global distribution of items when we calculate the probabilities that we use in our entropy calculation.

By dividing the number of items in each category that are returned for the active search by the total number of items in that category, we get a number we can think of as the affinity between the category and the search query. Although fewer than 50% of the results that come back for a query for “shoes” are in the Shoes category, 100% of items in the Shoes category are returned for a query for “shoes,” so its category affinity is much higher than its raw share of the result set.

Top Categories for "Shoes" by Affinity

Normalizing the affinity values so they sum to one, we use these measurements as the inputs to the same Shannon entropy formula that we used in the first iteration. The normalization step ensures that we can compare entropy values across search result sets of different sizes. Letting ri represent the number of items in category i for the active search query, and ti represent the total number of items in that category, the affinity value for category i, ai, is simply (ri / ti). Taking s as the sum of all affinity values a0…ai, then, the affinity-based entropy is:

Affinity-based entropy of a search result set

Affinity-based entropy of a search result set

From a Bayesian perspective, both the original result count-based values and the affinity values calculate the probability that a listing is in a category given that it is returned for the search query. The difference is that the affinity formulation corresponds to a flat prior distribution of categories whereas the original formulation corresponds to the observed category distribution of items in our marketplace. By controlling for the uneven distribution of items across categories on Etsy, affinity-based entropy fixed our “shoes” problem, and improved the quality of our system in general.

Refining by recipient on a query for “geeky shoes”

Refining by recipient on a query for “geeky shoes”

Keeping it simple

Although our iterations on entropy have introduced more complexity than we had at the outset, we still reap the benefits of avoiding opaque offline computations and dependencies on external infrastructure. Big data signals can be incredibly powerful, but they introduce architectural costs that it turns out aren’t necessary for a functional broad query classifier.

On the user-facing level, making Etsy easier to explore is something I’ve wanted to work on since before I started working here many years ago. It’s very frustrating for searchers to navigate through the millions of items of all types that we return for many popular queries. If you’ll indulge my thermodynamics metaphor once more, by helping to guide users out of high-entropy result sets, we’re battling the heat death of Etsy search—and that’s literally pretty cool.

 

Couldn’t stomach that “heat death” joke? Leave a comment or let me know on twitter.

Huge thanks due to Giovanni Fernandez-Kincade, Stan Rozenraukh, Jaime Delanghe and Rob Hall.

Posted by on August 31, 2015
Category: data, engineering, search Tags: , ,

7 Comments

Never knew so much mathematics was involved behind the scene!

I never realised thermodynamics could have such an application, this is brilliant! I love that you’re displaying 2nd-level categories rather than just going “oh, there’s not enough top-level categories to bother displaying the suggested categories”. I’m sure it will really help, because I think a lot of people would just dive straight in to looking through the mass of jumbled results rather than using the categories on the left to refine the results. I mean, I *know* those categories are there, yet half the time I forget to use them when I’m buying.

Cool article!

1 – I’ve seen other people use the Herfindahl index to test for concentration, which gives a nice number between 0 and 1. Have you considered using this?
2 – In the affinity based model, assume that there are two categories, X and Y. Category X has 1M items while category Y has 10 items. 1K items from category X are included and 9 items from category Y are included. It seems that the affinity-entropy (a = [0.0001, 0.9]) would be low in this case. Is not showing categories desired?

    Thanks so much for reading! Super thoughtful questions.

    “1 – I’ve seen other people use the Herfindahl index to test for concentration, which gives a nice number between 0 and 1. Have you considered using this?”

    Huh, no, we didn’t look into the Herfindahl index. It does appear to measure a very similar quality. The advantage I can see of using entropy is that it has a clean statistical interpretation in the affinity context which, at first glance, I can’t map to the Herfindahl index. We’re in the process of building out tooling for iterating on the formula & thresholds offline, so I’m excited to try it out and see how it works in our context, because it may improve on our current model. Thanks for the tip!

    “2 – In the affinity based model, assume that there are two categories, X and Y. Category X has 1M items while category Y has 10 items. 1K items from category X are included and 9 items from category Y are included. It seems that the affinity-entropy (a = [0.0001, 0.9]) would be low in this case. Is not showing categories desired?”

    Your example is a very interesting one… I see it as an extreme version of the “shoes” case. Because we can trust that no top-level category on our site is several orders of magnitude smaller than another, we’re able to use affinity pretty safely in that context. As we expand the experience descend deeper into the taxonomy, though, it becomes more of a realistic concern… we may end up using a combo signal, or adding explicit safeguards against very small categories. It’s another reason that we need the offline tooling to build confidence that extreme cases like that are trivially rare, and tune a fix if they aren’t.

Hi Fiona, Etsy has cool shoes and its time they got a better display 🙂 I believe this is a useful improvement. Thanks for sharing. Inspiring 🙂

Wow, great to see a simple solution that doesn’t use offline signals. That seems to be everyone’s go-to solution, which only complicates and slows things down. Thanks for the great post!