An Introduction to Structured Data at Etsy

Posted by on July 31, 2019

Etsy has an uncontrolled inventory; unlike many marketplaces, we offer an unlimited array of one-of-a-kind items, rather than a defined set of uniform goods. Etsy sellers are free to list any policy-compliant item that falls within the three broad buckets of craft supplies, handmade, and vintage. Our lack of standardization, of course, is what makes Etsy special, but it also makes learning about our inventory challenging. That’s where structured data comes in.

Structured vs. Unstructured Data

Structured data is data that exists in a defined relationship to other data. The relation can be articulated through a tree, graph, hierarchy, or other standardized schema and vocabulary. Conversely, unstructured data does not exist within a standardized framework and has no formal relationship to other data in a given space.

For the purposes of structured data at Etsy, the data are the product listings, and they are structured according to our conception of where in the marketplace they belong. That understanding is expressed through the taxonomy.

Etsy’s taxonomy is a collection of hierarchies comprised of 6,000+ categories (ex. Boots), 400+ attributes (ex. Women’s shoe size), 3,500+ values (ex. 7.5), and 90+ scales (ex. US/Canada). These hierarchies form the foundation of 3,500+ filters and countless category-specific shopping experiences on the site. The taxonomy imposes a more controlled view of the uncontrolled inventory — one that engineers can use to help buyers find what they are looking for. 

Building the Taxonomy

The Etsy taxonomy is represented in JSON files, with each category’s JSON containing information about its place in the hierarchy and the attributes, values, and scales for items in that category. Together, these determine what questions will be asked of the seller for listings in that category (Figure A, Box 1), and what filters will be shown to buyers for searches in that category (Figure A, Box 2).

Figure A 
A snippet of the JSON representation of the Jewelry > Rings > Bands category

The taxonomists at Etsy are able to alter the taxonomy hierarchies using an internal tool. This tool supports some unique behaviors of our taxonomy, like inheritance. This means that if a category has a particular filter, then all of its subcategories will inherit that filter as well.

Figure B
Sections of the Jewelry > Rings > Bands category as it appears in our internal taxonomy tool 

Gathering Structured Data: The Seller Perspective

One of the primary ways that we currently collect structured data is through the listing creation process, since that is our best opportunity to learn about each listing from the person who is most familiar with it: the seller!

Sellers create new listings using the Shop Manager. The first step in the listing process is to choose a category for the listing from within the taxonomy. Using auto-complete suggestions, sellers can select the most appropriate category from all of the categories available. 

Figure C 
Category suggestions for “ring”

At this stage in the listing creation process, optional attribute fields appear in the Shop Manager. This is also enabled by the taxonomy JSON, in that the fields correspond with the category selected by the seller (see Figure A, Box 1). This behavior ensures that we are only collecting relevant attribute data for each category and simplifies the process for sellers. Promoting this use of standardized data also reduces the need for overloaded listing titles and descriptions by giving sellers a designated space to tell buyers about the details of their products. Data collected during the listing creation process appears on the listing page, highlighting for the buyer some of the key, standardized details of the listing.

Figure D
Some of the attribute fields that appear for listings in Jewelry > Rings > Bands (see Figure A, Box 1 for the JSON that powers the Occasion attribute)

Making Use of Structured Data: The Buyer Perspective

Much of the buyer experience is a product of the structured data that has been provided by our sellers. For instance, a given Etsy search yields category-specific filters on the left-hand navigation of the search results page. 

Figure E
Some of the filters that appear upon searching for “Rings”

Those filters should look familiar! (see Figure D) They are functions of the taxonomy. The search query gets classified to a taxonomy category through a big data job, and filters affiliated with that category are displayed to the user (see Figure F below). These filters allow the buyer to narrow down their search more easily and make sense of the listings displayed.

Figure F
The code that displays category-specific filters upon checking that the classified category has buyer filters defined in its JSON (see Figure A, Box 2 for a sample filter JSON)

Structuring Unstructured Data

There are countless ways of deriving structured data that go beyond seller input. First, there are ways of converting unstructured data that has already been provided, like listing titles or descriptions, into structured data. Also, we can use machine learning to learn about our listings and reduce our dependence on seller input. We can, for example, learn about the color of a listing through the image provided; we can also infer more nuanced data about a listing, like its seasonality or occasion.

We can continue to measure the relevance of our structured data through metrics like the depth of our inventory categorization within our taxonomy hierarchies and the completeness of our inventory’s attribution.

All of these efforts allow us to continue to build deeper category-specific shopping experiences powered by structured data. By investing in better understanding our inventory, we create deeper connections between our sellers and our buyers.

Posted by on July 31, 2019
Category: engineering Tags: ,

1 Comment

Always hated JSON data, why can’t we all use CSV files for importing data? In the previous example I analyzed, this was probably possible, no need to complicate things with JSON. Etsy is completely different, every product may or may not have values in hundreds of data categories that other products just don’t use. This example is textbook for why some applications need JSON and can’t use CSV! I learned something today!