boundary-layer : Declarative Airflow Workflows

Posted by on November 14, 2018 / No Responses

When Etsy decided last year to migrate our operations to Google Cloud Platform (GCP), one of our primary motivations was to enable our machine learning teams with scalable resources and the latest big-data and ML technologies. Early in the cloud migration process, we convened a cross-functional team between the Data Engineering and Machine Learning Infrastructure groups in order to design and build a new data platform focused on this goal.

One of the first choices our team faced was how to coordinate and schedule jobs across a menagerie of new technologies. Apache Airflow (incubating) was the obvious choice due to its existing integrations with GCP, its customizability, and its strong open-source community; however, we faced a number of open questions that had to be addressed in order to give us confidence in Airflow as a long-term solution.

First, Etsy had well over 100 existing Hadoop workflows, all written for the Apache Oozie scheduler. How would we migrate these to Airflow? Furthermore, how would we maintain equivalent copies of Oozie and Airflow workflows in parallel during the development and validation phases of the migration, without requiring our data scientists to pause their development work?

Second, writing workflows in Airflow (expressed in python as directed acyclic graphs, or DAGs) is non-trivial, requiring new and specialized knowledge. How would we train our dozens of internal data platform users to write Airflow DAGs? How would we provide automated testing capabilities to ensure that DAGs are valid before pushing them to our Airflow instances? How would we ensure that common best-practices are used by all team members? And how would we maintain and update those DAGs as new practices are adopted and new features made available?

Today we are pleased to introduce boundary-layer, the tool that we conceived and built to address these challenges, and that we have released to open-source to share with the Airflow community.

Introduction: Declarative Workflows

Boundary-layer is a tool that enables data scientists and engineers to write Airflow workflows in a declarative fashion, as YAML files rather than as python. Boundary-layer validates workflows by checking that all of the operators are properly parameterized, all of the parameters have the proper names and types, there are no cyclic dependencies, etc. It then translates the workflows into DAGs in python, for native consumption by Airflow.

Here is an example of a very simple boundary-layer workflow:

name: my-dag-1

default_task_args:
  start_date: '2018-10-01'

operators:
- name: print-hello
  type: bash
  properties:
    bash_command: "echo hello"
- name: print-world
  type: bash
  upstream_dependencies:
  - print-hello
  properties:
    bash_command: "echo world"

Boundary-layer translates this into python as a DAG with 2 nodes, each consisting of a BashOperator configured with the provided properties, as well as some auto-inserted parameters:

# Auto-generated by boundary-layer

import os
from airflow import DAG

import datetime

from airflow.operators.bash_operator import BashOperator

DEFAULT_TASK_ARGS = {
        'start_date': '2018-10-01',
    }

dag = DAG(
        dag_id = 'my_dag_1',
        default_args = DEFAULT_TASK_ARGS,
    )

print_hello = BashOperator(
        dag = (dag),
        bash_command = 'echo hello',
        start_date = (datetime.datetime(2018, 10, 1, 0, 0)),
        task_id = 'print_hello',
    )


print_world = BashOperator(
        dag = (dag),
        bash_command = 'echo world',
        start_date = (datetime.datetime(2018, 10, 1, 0, 0)),
        task_id = 'print_world',
    )

print_world.set_upstream(print_hello)

Note that boundary-layer inserted all of the boilerplate of python class imports and basic DAG and operator configuration. Additionally, it validated parameter names and types according to schemas, and applied type conversions when applicable (in this case, it converted date strings to datetime objects).

Generators

Moving from python-based to configuration-based workflows naturally imposes a functionality penalty. One particularly valuable feature of python-based DAGs is the ability to construct them dynamically: for example, nodes can be added and customized by iterating over a list of values. We make extensive use of this functionality ourselves, so it was important to build a mechanism into boundary-layer to enable it.

Boundary-layer generators are the mechanism we designed for dynamic workflow construction. Generators are complete, distinct sub-workflows that take a single, flexibly-typed parameter as input. Each generator must prescribe a mechanism for generating a list of values: for example, lists of items can be retrieved from an API via an HTTP GET request. The python code written by boundary-layer will iterate over the list of generator parameter values and create one instance of the generator sub-workflow for each value. Below is an example of a workflow that incorporates a generator:

name: my-dag-2

default_task_args:
  start_date: '2018-10-01'

generators:
- name: retrieve-and-copy-items
  type: requests_json_generator
  target: sense-and-run
  properties:
    url: http://my-url.com/my/file/list.json
    list_json_key: items

operators:
- name: print-message
  type: bash
  upstream_dependencies:
  - retrieve-and-copy-items
  properties:
    bash_command: echo "all done"
---
name: sense-and-run

operators:
- name: sensor
  type: gcs_object_sensor
  properties:
    bucket: <<item['bucket']>>
    object: <<item['name']>>
- name: my-job
  type: dataproc_hadoop
  properties:
    cluster_name: my-cluster
    region: us-central1
    main_class: com.etsy.jobs.MyJob
    arguments:
    - <<item['name']>>

This workflow retrieves the content of the specified JSON file, extracts the items field from it, and then iterates over the objects in that list, creating one instance of all of the operators in the sense-and-run sub-graph per object.

Note the inclusion of several strings of the form  << ... >>.  These are boundary-layer verbatim strings, which allow us to insert inline snippets of python into the rendered DAG. The item value is the sub-workflow’s parameter, which is automatically supplied by boundary-layer to each instance of the sub-workflow.

Also note that generators can be used in dependency specifications, as indicated by the print-message operator’s upstream_dependencies block. Generators can even be set to depend on other generators, which boundary-layer will encode efficiently, without creating a combinatorially-exploding set of edges in the DAG.

Advanced features

Under the hood, boundary-layer represents its workflows using the powerful networkx library, and this enables a variety of features that require making computational modifications to the graph, adding usability enhancements that go well beyond the core functionality of Airflow itself.

A few of the simpler features that modify the graph include before and after sections of the workflow, which allow us to specify a set of operators that should always be run upstream or downstream of the primary list of operators. For example, one of our most common patterns in workflow construction is to put various sensors in the before block, so that it is not necessary to specify and maintain explicit upstream dependencies between the sensors and the primary operators. Boundary-layer automatically attaches these sensors and adds the necessary dependency rules to make sure that no primary operators execute until all of the sensors have completed.

Another feature of boundary-layer is the ability to prune nodes out of workflows, while maintaining all dependency relationships between the nodes that remain. This was especially useful during the migration of our Oozie workflows. It allowed us to isolate portions of those workflows for running in Airflow and gradually add more portions in stages, until the workflows were fully migrated, without ever having to create the portioned workflows as separate entities.

One of the most useful advanced features of boundary-layer is its treatment of managed resources. We make extensive use of ephemeral, workflow-scoped Dataproc clusters on the Etsy data platform. These clusters are created by Airflow, shared by various jobs that Airflow schedules, and then deleted by Airflow once those jobs are complete. Airflow itself provides no first-class support for managed resources, which can be tricky to configure properly: we must make sure that the resources are not created before they are needed, and that they are deleted as soon as they are not needed anymore, in order to avoid accruing costs for idle clusters. Boundary-layer handles this automatically, computing the appropriate places in the DAG into which to splice the resource-create and resource-destroy operations. This makes it simple to add new jobs or remove old ones, without having to worry about keeping the cluster-create and cluster-destroy steps always installed in the proper locations in the workflow.

Below is an example of a boundary-layer workflow that uses Dataproc resources:

name: my-dag-3

default_task_args:
  start_date: '2018-10-01'
  project_id: my-gcp-project

resources:
- name: dataproc-cluster
  type: dataproc_cluster
  properties:
    cluster_name: my-cluster
    region: us-east1
    num_workers: 128

before:
- name: sensor
  type: gcs_object_sensor
  properties:
    bucket: my-bucket
    object: my-object

operators:
- name: my-job-1
  type: dataproc_hadoop
  requires_resources:
  - dataproc-cluster
  properties:
    main_class: com.etsy.foo.FooJob
- name: my-job-2
  type: dataproc_hadoop
  requires_resources:
  - dataproc-cluster
  upstream_dependencies:
  - my-job-1
  properties:
    main_class: com.etsy.bar.BarJob
- name: copy-data
  type: gcs_to_gcs
  upstream_dependencies:
  - my-job-2
  properties:
    source_bucket: my-bucket
    source_object: my-object
    dest_bucket: your-bucket

In this DAG, the gcs_object_sensor runs first, then the cluster is created, then the two hadoop jobs run in sequence, and then the job’s output is copied while the cluster is simultaneously deleted.

Of course, this is just a simple example; we have some complex workflows that manage multiple ephemeral clusters, with rich dependency relationships, all of which are automatically configured by boundary-layer. For example, see the figure below: this is a real workflow that runs some hadoop jobs on one cluster while running some ML training jobs in parallel on an external service, and then finally runs more hadoop jobs on a second cluster. The complexity of the dependencies between the training jobs and downstream jobs required boundary-layer to insert several flow-controloperators in order to ensure that the downstream jobs start only once all of the upstream dependencies are met.

Conversion from Oozie

One of our primary initial concerns was the need to be able to migrate our Oozie workflows to Airflow. This had to be an automated process, because we knew we would have to repeatedly convert workflows in order to keep them in-sync between our on-premise cluster and our GCP resources while we developed and built confidence in the new platform. The boundary-layer workflow format is not difficult to reconcile with Oozie’s native configuration formats, so boundary-layer is distributed with a parser that does this conversion automatically. We built tooling to incorporate the converter into our CI/CD processes, and for the duration of our cloud validation and migration period, we maintained perfect fidelity between on-premise Oozie and cloud-based Airflow DAGs.

Extensibility

A final requirement that we targeted in the development of boundary-layer is that it must be easy to add new types of operators, generators, or resources. It must not be difficult to modify or add to the operator schemas or the configuration settings for the resource and generator abstractions. After all, Airflow’s huge open-source community (including several Etsy engineers!) ensures that its list of supported operators is growing practically every day. In addition, we have our own proprietary set of operators for Etsy-specific purposes, and we must keep the configurations for these out of the public boundary-layer distribution. We satisfied these requirements via two design choices.

First, every operator, generator, or resource is represented by a single configuration file, and these files get packaged up with boundary-layer. Adding a new operator/generator/resource is accomplished simply by adding a new configuration file. Here is an example configuration, in this case for the AirflowBashOperator:

name: bash
operator_class: BashOperator
operator_class_module: airflow.operators.bash_operator
schema_extends: base

parameters_jsonschema:
  properties:
    bash_command:
      type: string
    
    xcom_push:
      type: boolean
    
    env:
      type: object
      additionalProperties:
        type: string
    
    output_encoding:
      type: string
  
  required:
  - bash_command
  
  additionalProperties: false

We use standard JSON Schemas to specify the parameters to the operator, and we use a basic single-inheritance model to centralize the specification of common parameters in theBaseOperator, as is done in the Airflow code itself.

Second, we implemented a plugin mechanism based on python’s setuptools entrypoints. All of our internal configurations are integrated into boundary-layer via plugins. We package a single default plugin with boundary layer that contains configurations for common open-source Airflow operators. Other plugins can be added by packaging them into separate python packages, as we have done internally with our Etsy-customized plugin. The plugin mechanism has grown to enable quite extensive workflow customizations, which we use at Etsy in order to enable the full suite of proprietary modifications used on our platform.

Conclusion

The boundary-layer project has been a big success for us. All of the nearly-100 workflows that we deploy to our production Airflow instances are written as boundary-layer configurations, and our deployment tools no longer even support python-based DAGs. Boundary-layer’s ability to validate workflow configurations and abstract away implementation details has enabled us to provide a self-service Airflow solution to our data scientists and engineers, without requiring much specialized knowledge of Airflow itself. Over 30 people have contributed to our internal Airflow workflow repository, with minimal process overhead (Jenkins is the only “person” who must approve pull requests), and without having deployed a single invalid DAG.

We are excited to release boundary-layer to the public, in hopes that other teams find it similarly useful. We are committed to supporting it and continuing to add new functionality, so drop us a github issue if you have any requests. And of course, we welcome community contributions as well!

No Comments

Double-bucketing in A/B Testing

Posted by on November 7, 2018 / No Responses

Previously, we’ve posted about the importance we put in Etsy’s experimentation systems for our decision-making process. In a continuation of that theme, this post will dive deep into an interesting edge case we discovered.

We ran an A/B test which required a 5% control variant and 95% treatment variant rather than the typical split of 50% for control and treatment variants.  Based on the nature of this particular A/B test, we expected a positive change for conversion rate, which is the percent of users that make a purchase.

At the conclusion of the A/B test, we had some unexpected results. Our A/B testing tool, Catapult, showed the treatment variant “losing” to the control variant.  Catapult was showing a negative change in conversion rate when we’d expect a positive rate of change.

Due to these unexpected negative results, the Data Analyst team investigated why this was happening. This quote summarizes their findings

The control variant “benefited” from double-bucketing because given its small size (5% of traffic), receiving an infusion of highly engaged browsers from the treatment provided an outsized lift on its aggregate performance.

With the double-bucketed browsers excluded, the true conversion rate of change is positive which is the results that we expected from the A/B test.  Just 0.02% of the total browsers in the A/B test were double-bucketed. This small percentage of the total browsers had a large impact on the A/B test results.  This post will cover the details of why that occurred.

Definition of Double-bucketing

So what exactly is double-bucketing?

In an A/B test, a user is shown either the control or treatment experience. The process to determine which variant the user falls into is called ‘bucketing’. Normally, a user experiences only the control or only the treatment; however in this A/B test, there was a tiny percentage of users who experienced both variants. We call this error in bucketing ‘double-bucketing’.

Typical user 50/50 bucketing for an A/B test puts ½ of the users into the control variant and ½ into the treatment variant. Those users stay in their bucketed variant. We calculate metrics and run statistical tests by summing all the data for the users in each variant.

However, the double-bucketing error we discovered would place the last 2 users in both control and treatment variants, as shown below. Now those users’ data is counted in both variants for statistics on all metrics in the experiment.

How browsers are bucketed

Before discussing the cases of double-bucketing that we found, it helps to have a high-level understanding of how A/B test bucketing works at Etsy.

For etsy.com web requests, we use an unique identifier from the user’s browser cookie which we refer to as “browser id”.  Using the string value from the cookie, our clickstream data logic, named EventPipe, sets the browser id property on each event.

Bucketing is determined by a hash. First we concatenate the name of the A/B test and the browser id.  The name of the A/B test is referred to as the “configuration flag”. That string is hashed using SHA-256 and then converted to an integer between 0 and 99. For a 50% A/B test, if the value is < 50, the browser is bucketed into the treatment variant. Otherwise, the browser is in the control variant.  Because the hashing function is deterministic, the user should be bucketed into the same variant of an experiment as long as the browser cookie remains the same.

EventPipe adds the configuration flag and bucketed variant information to the “ab” property on events.

For an A/B test’s statistics in Catapult, we filter by the configuration flag and then group by the variant.

This bucketing logic is consistent and has worked well for our A/B testing for years.  Although occasionally some experiments wound up with small numbers of double-bucketed users, we didn’t detect a significant impact until this particular A/B test with a 5% control.

Some Example Numbers (fuzzy math)

We’ll use some example numbers with some fuzzy math to understand how the conversion rate was effected so much by only 0.02% double-bucketed browsers.

For most A/B tests, we do 50/50 bucketing between the control variant and treatment variants. For this A/B test, we did a 5% control which puts 95% in the treatment.

If we start with 1M browsers, our 50% A/B test has 500K browsers in both control and treatment variants. Our 5% control A/B test has 50K browsers in the control variant and 950K in the treatment variant.

Let’s assume a 10% conversion rate for easy math. For the 50% A/B test, we have 50K converted browsers in both the control and treatment variant. Our 5% control A/B test has 5K converted browsers in the control variant and 95K in the treatment variant.

For the next step, let’s assume 1% of the converting browsers are double-bucketed. When we add the double-bucketed browsers from the opposite variant to both the numerator and denominator, we get a new conversion rate. For our 50% A/B test, that is 50,500 converted browsers in both the control and treatment variants. The new conversion rate is slightly off from the expected conversion rate but only by 0.1%.

For our 5% control A/B test, the treatment variant’s number of converted browsers only increased by 50 browsers from 95,000 to 95,050. The treatment variant’s new conversion rate still rounds to the expected 10%.

But for our 5% control A/B test, the control variant’s number of converted browsers jumps from 5000 to 5950 browsers. This causes a huge change in the control variant’s conversion rate – from 10% to 12% – while the treatment variant’s conversion rate was unchanged.

Cases of Double-bucketing

Once we understood that double-bucketing was causing these unexpected results, we started digging into what cases led to double-bucketing of individual browsers. We found two main cases. Since conversion rates were being affected, unsurprisingly both cases involved checkout.

Checkout from new device

When browsing etsy.com while signed out, you can add listings to your cart.

Once you click the “Proceed to checkout” button, you are prompted to sign in. You get a sign in screen similar to this.

After you sign in, if we have never seen your browser before, then we email you a security alert that you’ve been signed in from a new device. This is a wise security practice and pretty standard across the internet.

Many years ago, we were doing A/B testing on emails which were all sent from offline jobs. Gearman is our framework for running offline jobs based on http://gearman.org. In Gearman, we have no access to cookies and thus cannot get the browser id, but we do have the email address. So override logic was added deep in email template logic to bucket by email address rather than by browser id.

This worked perfectly. But the security email isn’t sent from Gearman; it is coming from the sign in request. So now our bucketing for the same browser id has this different bucketing based on email address rather than browser id.

This worked perfectly for A/B testing in emails sent by Gearman, but the logic applied to all emails, not just those sent by Gearman. Even though the security email is sent by the sign in request (not Gearman), the logic updated the bucketing ID to be the user’s email address rather than the browser id so that the browser might be bucketed into two different variants (once using the browser id and once using the email address).

Since we are no longer using that email system for A/B testing, we were able to simply remove the override call.

Pattern Checkout

Pattern is Etsy’s tool that sellers use to create personalized, separate websites for their businesses.  Pattern shops allow listings to be added to your cart while on the shop’s patternbyetsy.com domain.

The checkout occurs on etsy.com domain instead of the patternbyetsy.com domain. Since the value from the user’s browser cookie is what we bucket on and we cannot share cookies across domains, we have two different hashes used for bucketing.

In order to attribute conversions to Pattern, we have logic to override the browser id with the value from the patternbyetsy.com cookie during the checkout process on etsy.com. This override logic works for attributing conversions; however during sign in some bucketing happens prior to the execution of the override logic by the controllers.

For this case, we chose to remove bucketing data for Pattern visits as this override caused the bucketing logic to put the same user into both the control and treatment variants.

Conclusions

Here is a dashboard of double-bucketed browsers per day that helped us track our fixes of double-bucketing.

No Comments

Capacity planning for Etsy’s web and API clusters

Posted by on October 23, 2018 / No Responses

Capacity planning for the web and API clusters powering etsy.com has historically been a once per year event for us.The purpose is to gain an understanding of the capacity of the heterogeneous mix of hardware in our datacenters that make up the clusters. This was usually done a couple of weeks before the time we call Slush. Slush (a word play on code freeze) is the time range of approximately 2 months around the holidays where we deliberately decide to slow down our rate of change on the site but not actually stop all development. We do this to recognize the fact that the time leading up to the holidays is the busiest and most important time for a lot of our sellers and any breakage results in higher than usual impact on their revenue. This also means it’s the most important time for us to get capacity estimates right and make sure we are in a good place to serve traffic throughout the busy holiday season.

During this exercise of forecasting and planning capacity someone would collect all relevant core metrics (the most important one being requests per second on our Apache httpd infrastructure) from our Ganglia instances and export them to csv. Those timeseries data would then be imported into something that would give us a forecasting model. Excel, R, and python scripts are examples of tools that have been used in previous years for this exercise.

After a reorg of our systems engineering department in 2017, Slush that year was the first time the newly formed Compute team was tasked with capacity planning for our web and api tiers. And as we set out to do this, we had three goals:

First we started with a spreadsheet to track everything that we would be capacity planning for. Then we got an overview of what we had in terms of hardware serving those tiers. We got this from running a knife search like this:

knife search node "roles:WebBase" -a cpu.0.model_name -a cpu.cores -F json

and turning it into CSV via a ruby script, so we could have it in the spreadsheet as well. Now that we had the hardware distribution of our clusters, we gave each model a score so we could rank them and derive performance differences and loadbalancer weighting scores if needed. These performance scores are a rough heuristic to allow relative comparison of different CPUs. It takes into account core count, clock speed and generational improvements (assuming a 20% improvement between processors for the same clock speed and core count). It’s not an exact science at this point but a good enough measure to get us a useful idea of how to compare different hardware generations against each other. Then we assigned each server a performance score, based on that heuristic.

Next up was the so called “squeeze testing”. The performance scores weren’t particularly helpful without knowing what they mean in actual work a server with that score can do on different cluster types. Request work on our frontend web servers is very different than the work on our component api tier for example. So a performance score of 50 means something very different depending on which cluster we are talking about.

Squeeze testing is the capacity planning exercise of trying to see how much performance you can squeeze out of a service, usually by gradually increasing the amount of traffic it receives and how much it can handle before exhausting its resources. In the scenario of an established cluster this is often hard to do as we can’t arbitrarily add more traffic to the site. That’s why we turned the opposite dial and removed resources (i.e. servers) from a cluster until the cluster (almost) started to not serve in an appropriate manner anymore.

So for our web and api clusters this meant removing nodes from the serving pools until they drop to about 25% idle CPU and noting the number of requests per second they are serving at this point. 20% idle CPU is a threshold on those tiers where we start to see performance decrease due to the rest of the CPU time being used for tasks like context switching and other non application workloads. That means stopping at 25% gives us headroom for some variance in this type of testing and also means we weren’t hurting actual site performance while doing the squeeze testing.

Now that we got the number of requests per second we could process based on the performance score, the only thing we were missing was knowing how much traffic we expect to see in the coming months. This meant in the past – as mentioned above – that we would download timeseries data from Ganglia for requests per second for each cluster for every datacenter we had nodes in. Then that data needed to be combined to get the total sum of requests we have been serving. Then we would take that data and stick it into Excel and try a couple of Excel’s curve fitting algorithms, see which looked best and take the forecasting results based on fit. We have also used R or python for that task in previous years. But it was always a very handcrafted and manual process.

So this time around we wrote a tool to do all this work for us called “Ausblick”. It’s based on Facebook’s prophet and automatically pulls in data from Ganglia based on host and metric regexes, combines the data for each datacenter and then runs forecasting on the timeseries and shows us a nice plot for it. We can also give it a base value and list of hosts with perfscores and ausblick will draw the current capacity of the cluster into the plot as a horizontal red line. Ausblick runs in our Kubernetes cluster and all interactions with the tool are happening through its REST API and an example request looks like this:

% cat conapi.json
{ "title": "conapi cluster",
  "hostregex": "^conapi*",
  "metricsregex": "^apache_requests_per_second",
  "datacenters": ["dc1","dc2"],
  "rpsscore": 9.5,
  "hosts": [
    ["conapi-server01.dc1.etsy.com", 46.4],
    ["conapi-server02.dc1.etsy.com", 46.4],
    ["conapi-server03.dc2.etsy.com", 46.4],
    ["conapi-server04.dc1.etsy.com", 27.6],
    ["conapi-server05.dc2.etsy.com", 46.4],
    ["conapi-server06.dc2.etsy.com", 27.6],
    ["conapi-server06.dc1.etsy.com", 46.4]
  ]
}
% curl -X POST http://ausblick.etsycorp.com/plan -d @conapi.json --header "Content-Type: application/json"
{"plot_url": "/static/conapi_cluster.png"}%

In addition to this API we wrote an integration for our Slack bot to easily generate a new forecast based on current data.

Ausblick Slack integration

Ausblick Slack integration

And to finish this off with a bunch of graphs, here is what the current forecasting looks like for some of our internal api tiers, that are backing etsy.com:

Ausblick forecast for conapi cluster

Ausblick forecast for conapi cluster

Ausblick forecast for compapi cluster

Ausblick forecast for compapi cluster

Ausblick has allowed us to democratize the process of capacity forecasting to a large extent and given us the ability to redo forecasting estimates at will. We used this process successfully for last year’s Slush and are in the process of adapting it to our cloud infrastructure after our recent migration of the main etsy.com components to GCP.

No Comments

Etsy’s experiment with immutable documentation

Posted by on October 10, 2018 / 9 Comments

Introduction

Writing documentation is like trying to hit a moving target. The way a system works changes constantly, so as soon as you write a piece of documentation for it, it starts to get stale. And the systems that need docs the most are the ones being actively used and worked on, which are changing the fastest. So the most important docs go stale the fastest! 1

Etsy has been experimenting with a radical new approach: immutable documentation.

Woah, you just got finished talking about how documentation goes stale! So doesn’t that mean you have to update it all the time? How could you make documentation read-only?

How docs go stale

Let’s back up for a sec. When a bit of a documentation page becomes outdated or incorrect, it typically doesn’t invalidate the entire doc (unless the system itself is deprecated). It’s just a part of the doc with a code snippet, say, which is maybe using an outdated syntax for an API.

For example, we have a command-line tool called dbconnectthat lets us query the dev and prod databases from our VMs. Our internal wiki has a doc page that discusses various tools that we use to query the dbs. The part that discusses ‘dbconnect’ goes something like:

 

Querying the database via dbconnect ...

((section 1))
dbconnect is a script to connect to our databases and query them. [...]

((section 2))
The syntax is:

% dbconnect <shard>

 

Section 1 gives context about dbconnect and why it exists, and section 2 gives tactical details of how to use it.

Now say a switch is added so that dbconnect --dev <shard> queries the dev db, and dbconnect --prod <shard> queries the prod db. Section 2 above now needs to be updated, because it’s using outdated syntax for the dbconnect command. But the contextual description in section 1 is still completely valid. So this doc page is now technically stale as a whole because of section 2, but the narrative in section 1 is still very helpful!

In other words, the parts of the doc that’s most likely to go stale are the tactical, operational details of the system. How to use the system is constantly changing. But the narrative of why the system exists and the context around it is less likely to change quite so quickly.

 

How to use the system is constantly changing. But the narrative of why the system exists and the context around it is less likely to change quite so quickly.

 

Docs can be separated into how-docs and why-docs

Put another way: ‘code tells how, docs tell why’  2. Code is constantly changing, so the more code you put into your docs, the faster they’ll go stale. To codify this further, let’s use the term “how-doc” for operational details like code snippets, and “why-doc” for narrative, contextual descriptions  3. We can mitigate staleness by limiting the amount we mix the how-docs with the why-docs.

 

We can mitigate staleness by limiting the amount we mix the how-docs with the why-docs.

 

Documenting a command using Etsy’s FYI system

At Etsy we’ve developed a system for adding how-docs directly from Slack. It’s called “FYI”. The purpose of FYI is to make documenting tactical details — commands to run, syntax details, little helpful tidbits — as frictionless as possible.

 

FYI is a system for adding how-docs directly from Slack.

 

Here’s how we’d approach documenting dbconnect using FYIs 4:

Kaley was searching the wiki for how to connect to the dbs from her VM, to no avail. So she asks about it in a Slack channel:

hey @here anyone remember how to connect to the dbs in dev? I forget how. It’s something like dbconnect etsy_shard_001A but that’s not working

When she finds the answer, she adds an FYI using the ?fyi command (using our irccat integration in Slack 5):

?fyi connect to dbs with `dbconnect etsy_shard_000_A` (replace `000` with the shard number). `A` or `B` is the side

Jason sees Kaley add the FYI and mentions you can also use dbconnect to list the databases:

you can also do `dbconnect -l` to get a list of all DBs/shards/etc, and it works for dev-proxy on or off

Kaley then adds the :fyi: Slack reaction (reacji) to his comment to save it as an FYI:

you can also do `dbconnect -l` to get a list of all DBs/shards/etc, and it works for dev-proxy on or off

A few weeks later, Paul-Jean uses the FYI query command ?how to search for info on connecting to the databases, and finds Kaley’s FYI 6:

?how database connect

He then looks up FYIs mentioning dbconnect specifically to discover Jason’s follow-up comment:

?how dbconnect

But he notices that the dbconnect command has been changed since Jason’s FYI was added: there is now a switch to specify whether you want dev or prod databases. So he adds another FYI to supplement Jason’s:

?fyi to get a list of all DBs/shards/etc in dev, use `dbconnect --dev`, and to list prod DBs, use `dbconnect --prod` (default)

Now ?how dbconnect returns Paul-Jean’s FYI first, and Jason’s second:

?how dbconnect

FYIs trade completeness for freshness

Whenever you do a ?how query, matching FYIs are always returned most recent first. So you can always update how-docs for dbconnect by adding an FYI with the keyword “dbconnect” in it. This is crucial, because it means the freshest docs always rise to the top of search results.

FYIs are immutable, so Paul-Jean doesn’t have to worry about changing any FYIs created by Jason. He just adds them as he thinks of them, and the timestamps determine the priority of the results. How-docs change so quickly, it’s easier to just replace them than try to edit them. So they might as well be immutable.

 

How-docs change so quickly, it’s easier to just replace them than try to edit them. So they might as well be immutable.

 

Since every FYI has an explicit timestamp, it’s easy to gauge how current they are relative to API versions, OS updates, and other internal milestones. How-docs are inherently stale, so they might as well have a timestamp showing exactly how stale they are.

 

How-docs are inherently stale, so they might as well have a timestamp showing exactly how stale they are.

 

The tradeoff is that FYIs are just short snippets. There’s no room in an FYI to add much context. In other words, FYIs mitigate staleness by trading completeness for freshness.

 

FYIs mitigate staleness by trading completeness for freshness

 

Since FYIs lack context, there’s still a need for why-docs (eg a wiki page) about connecting to dev/prod dbs, which mentions the dbconnect  command along with other relevant resources. But if the how-docs are largely left in FYIs, those why-docs are less likely to go stale.

So FYIs allow us to decouple how-docs from why-docs. The tactical details are probably what you want in a hurry. The narrative around them is something you sit back and read on a wiki page.

 

FYIs allow us to decouple how-docs from why-docs

What FYIs are

To summarize, FYIs are:

What FYIs are NOT

Similarly, FYIs are NOT:

Conclusions

Etsy has recognized that technical documentation is a mixture of two distinct types: a narrative that explains why a system exists (“why-docs”), and operational details that describe how to use the system (“how-docs”). In trying to overcome the problem of staleness, the crucial observation is that how-docs typically change faster than why-docs do. Therefore the more how-docs are mixed in with why-docs in a doc page, the more likely the page is to go stale.

We’ve leveraged this observation by creating an entirely separate system to hold our how-docs. The FYI system simply allows us to save Slack messages to a persistent data store. When someone posts a useful bit of documentation in a Slack channel, we tag it with the :fyi: reacji to save it as a how-doc. We then search our how-docs directly from Slack using a bot command called ?how.

FYIs are immutable: to update them, we simply add another FYI that is more timely and correct. Since FYIs don’t need to contain narrative, they’re easy to add, and easy to update. The ?how command always returns more recent FYIs first, so fresher matches always have higher priority. In this way, the FYI system combats documentation staleness by trading completeness for freshness.

We believe the separation of operational details from contextual narrative is a useful idea that can be used for documenting all kinds of systems. We’d love to hear how you feel about it! And we’re excited to hear about what tooling you’ve built to make documentation better in your organization. Please get in touch and share what you’ve learned. Documentation is hard! Let’s make it better!

Acknowledgements

The FYI system was designed and implemented by Etsy’s FYI Working Group: Paul-Jean Letourneau, Brad Greenlee, Eleonora Zorzi, Rachel Hsiung, Keyur Govande, and Alec Malstrom. Special thanks to Mike Lang, Rafe Colburn, Sarah Marx, Doug Hudson, and Allison McKnight for their valuable feedback on this post.

References

  1. From “The Golden Rules of Code Documentation”: “It is almost impossible without an extreme amount of discipline, to keep external documentation in-sync with the actual code and/or API.”
  2. Derived from “code tells what, docs tell why” in this HackerNoon post.
  3. The similarity of the terms “how-doc” and “why-doc” to the term here-doc is intentional. For any given command, a here-doc is used to send data into the command in-place, how-docs are a way to document how to use the command, and why-docs are a description of why the command exists to begin with.
  4. You can replicate the FYI system with any method that allows you save Slack messages to a predefined, searchable location. So for example, one could simply install the Reacji Channeler bot, which lets you assign a Slack reacji of your choosing to cause the message to be copied to a given channel. So you could assign an “fyi” reacji to a new channel called “#fyi”, for example. Then to search your FYIs, you would simply go to the #fyi channel and search the messages there using the Slack search box.
  5. When the :fyi: reacji is added to a Slack message (or the ?fyi irccat command is used), an outgoing webhook sends a POST request to irccat.etsy.com with the message details. This triggers a PHP script to save the message text to a SQLite database, and sends an acknowledgement back to the Slack incoming webhook endpoint. The acknowledgement says “OK! Added your FYI”, so the user knows their FYI has been successfully added to the database.
  6. Searching FYIs using the ?how command uses the same architecture as for adding an FYI, except the PHP script queries the SQLite table, which supports full-text search via the FTS plugin.

9 Comments

How Etsy Handles Peeking in A/B Testing

Posted by and on October 3, 2018 / 1 Comment

Etsy relies heavily on experimentation to improve our decision-making process. We leverage our internal A/B testing tool when we launch new features, polish the look and feel of our site, or even make changes to our search and recommendation algorithms. For years, Etsy has prided ourselves on our culture of continuous experimentation. However, as our experimentation platform scales and the velocity of experimentation increases rapidly across the company, we also face a number of new challenges. In this post, we investigate one of these challenges: how to peek at experimental results early in order to increase the velocity of our decision-making without sacrificing the integrity of our results.

The Peeking Problem

In A/B testing, we’re looking to determine if a metric we care about (i.e. percentage of visitors who make a purchase) is different between the control and treatment groups. But when we detect a change in the metric, how do we know if it is real or due to random chance? We can look at the p-value of our statistical test, which indicates the probability we would see the detected difference between groups assuming there is no true difference. When the p-value falls below the significance level threshold we say that the result is statistically significant and we reject the hypothesis that the control and treatment are the same.

So we can just stop the experiment when the hypothesis test for the metric we care about has a p-value of less than 0.05, right? Wrong. In order to draw the strongest conclusions from the p-value in the context of an A/B test, we have to have fixed the sample size of an experiment in advance, and to only make a decision on the p-value once. Peeking at data regularly and stopping an experiment as soon as the p-value dips below 0.05 increases the rate of Type I errors, or false positives, because the false positive of each test compounds increasing the overall probability that you’ll see a false result.

Let’s look at an example to gain a more concrete view of the problem. Suppose we run an experiment where there is no true change between the control and experimental variant and both have a baseline target metric of 50%. If we are using a significance level of 0.1 and there is no peeking, in other words, the sample size needed before a decision is made is determined in advance, then the rate of false positives is 10%. However, if we do peek and we check the significance level at every observation, then after 500 observations, there is over a 50% chance of incorrectly stating that treatment is different than the control (Figure 1).

Figure 1: Chances for accepting that A and B are different, with A and B both converting at 50%.

At this point, you might already have figured that the simplest way to solve the problem would be to fix a sample size in advance and run an experiment until the end before checking the significance level. However, this requires strictly enforced separation between the design and analysis of experiments which can have large repercussions throughout the experimental process.  In early stages of an experiment, we may miss a bug in the set up or with the feature being tested that will invalidate our results later. If we don’t catch these early, it slows down our experimental process unnecessarily, leaving less time for iterations and real site changes. Another issue involved in set up is that it can be difficult to predict the effect size product teams would like to obtain prior to the experiment, which can make it hard to optimize the sample size in advance.  Even assuming we set up our experiment perfectly, there are down the line implications. If an experiment is impacting a metric in a negative way, we want to be aware as soon as possible so we don’t negatively affect our users’ experience. These considerations become even more pronounced when we’re running an experiment on a small population, or in a less trafficked part of the site and it can take months to reach the target sample size.  Across teams, we want to be able to iterate quickly without sacrificing the integrity of our results.

With this in mind, we need to come up with statistical methodology that will give reliable inference while still providing product teams the ability to continuously monitor experiments, especially for our long-running experiments. At Etsy, we tackle this challenge from two sides, user interface and statistical procedures. We made a few user interface changes to our A/B testing tool to prevent our stakeholders from drawing false conclusions, and we implemented a flexible p-value stopping-point in our platform, which takes inspiration from the sequential testing concept in statistics.

It is worth noting that the peeking problem has been studied by many, including industry veterans1, 2, developers of large-scale commercial A/B testing platforms3, 4 and academic researchers5. Moreover, it is hardly a challenge exclusive to A/B testing on the web. The peeking problem has troubled the medical field for a long time; for example, medical scientists could peek at the results and stop a clinical trial early because of initial positive results, leading to flawed interpretations of the data6, 7.

Our Approach

In this section, we dive into the approach that we have designed and adapted to address the peeking problem: transitioning from traditional, fixed-horizon testing to sequential testing, and preventing peeking behaviors through user interface changes.

Sequential Testing with Difference in Converting Visits

Sequential testing, which has been widely used in clinical trials8, 9 and gained recent popularity for web experimentation10 , guarantees that if we end the test when the p-value is below a predefined threshold α , the false positive rate will be no more than α. It does so by computing the probabilities of false-positives at each potential stopping point using dynamic programming, assuming that our test statistic is normally distributed. Since we can compute these probabilities, we can then adjust the test’s p-value threshold, which in turn changes the false-positive chance, at every step so that the total false positive rate is below the threshold that we desire. Therefore, sequential testing enables concluding experiments as soon as the data justifies it, while also keeping our false positive rate in check.

We investigated a few methods including O’Brien-Fleming, Pocock and sequential testing using difference in successful observations. We ultimately settled on the last approach. Using the difference in successful observations, we look at the raw difference in converting visits and stop an experiment when this difference becomes large enough.  The difference threshold is only valid until we reach a total number of converted visits. This method is good for detecting small changes and does so quickly, which makes it most suitable for our needs. Nevertheless, we did consider some cons this method presented as well. Traditional power and significance calculations use proportion of successes whereas looking at difference in converted visits does not take into account total population size.  Because of this, we are more likely to reach the total number of converted visits before we see a large enough difference in converted visits with high baselines target metrics. This means we are more likely to miss a true change in these cases. Furthermore, it requires extra set up when an experiment is not evenly split across variants. We chose to use this method with a few adjustments for these shortcomings so we could increase our speed of detecting real changes between experimental groups.

Our implementation of this method is influenced by the approach Evan Miller described here. This method sets a threshold for difference between the control and treatment converted visits based on minimal detected effect and target false positive and negative rates.  If the experiment reaches or passes the threshold, we allow the experiment to end early. If this difference is not reached, we assess our results using the standard approach of a power analysis.  The combination of these methods creates a continuous p-value threshold for which we can safely stop an experiment when the p-value is under the curve. This threshold is lower near the beginning of an experiment and converges to our significance level as the experiment reaches our targeted power. This allows us to detect changes quicker with low baselines while not missing smaller changes for experiments with high baseline target metrics.

Figure 2: Example of a p-value threshold curve.

To validate this approach, we tested it on results from experimental simulations with various baselines and effect sizes using mock experimental conditions. Before implementing, we wanted to understand:

  1. What effect will this have on false positive rates?
  2. What effect does early stopping have on reported effect size and confidence intervals?
  3. How much faster will we get a signal for experiments with true changes between groups?

We found that when using a p-value curve tuned for a 5% false positive rate, our early stopping threshold does not materially increase the false positive rate and we can be confident of a directional change.  

One of the downfalls with stopping experiments early, however, is that with an effect size under ~5%, we tend to overestimate the impact and widen the confidence interval.  To accurately attribute increases in metrics to experimental wins, we developed a haircut formula to apply to the effect size in metrics for experiments that we decide to end early.  Furthermore, we offset some of these by setting a standard of running experiments for at least 7 days to account for different weekend and weekday trends.

Figure 3: Reported Vs. True Effect Size

We tested this method with a series of simulations and saw that for experiments which would take 3 weeks to run assuming a standard power analysis, we could save at least a week in most cases where there was a real change between variants.  This helped us feel confident that even with a slight overestimation of effect size, it was worth the time savings for teams with low baselines target metrics who typically struggle with long experimental run times.

Figure 4: Day Savings From Sequential Testing

UI Improvements

In our experimental testing tool, we wanted stakeholders to have access to metrics and calculations we measure throughout the duration of the experiment. In additional to the p-value, we care about power and confidence interval.  First, power.  Teams at Etsy have to often coordinate experiments on the same page so it is important for teams to have an idea of how long an experiment will have to run assuming no early stopping. We do this by running an experiment until we reach a set power.

Second, Confidence interval (CI), is the range of values that are a good estimate of the true value in which we are confident a particular metric falls. In the context of A/B testing for example, if we ran the experiment millions of times, 90% of the time the true value of some effect size would fall within the 90% CI. There are three things that we care most about in relation to the confidence interval of an effect in an experiment:

  1. Whether the CI includes zero, because this maps exactly to the decision we would make with the p-value; if the 90% CI includes zero, then the p-value is greater than 0.1. Conversely, if it doesn’t include zero, then the p-value is less than 0.1;
  2. The smaller the CI, the better estimate of the parameter we have;
  3. The farther away from zero the CI is, the more confident we can be that there is a true difference.

Previously in our A/B testing tool UI, we displayed statistical data as shown in the table below on the left. The “observed” column indicates results for the control and there is a “% Change” column for each treatment variant. When hovering over a number in the “% Change” column, a popover table appears, showing the observed and actual effect size, confidence level, p-value, and number of days we could expect to have enough data to power the experiment based on our expected effect size. 

Figure 5: User interface before changes.

However, always displaying numerical results in the “% Change” column could lead to stakeholders peeking at data and making an incorrect inference about the success of the experiment. Therefore, we added a row in the hover table to show the power of the test (assuming some fixed effect size), and made the following changes to our user interface:

  1. Show a visualization of the C.I. and color the bar red when the C.I. is entirely negative to indicate a significant decrease, green when the C.I. is entirely positive to indicate a significant increase, and grey when the C.I. spans 0.
  2. Display different messages in the “% Change” column and hover table to indicate different stages the experiment metric is currently in, depending on its power, p-value and calculated flexible p-value threshold. In the “% Change” column, possible messages include “Waiting on data”, “Not enough data”, “No change” and “+/- X %” (to show significant increase/ decrease). In the hover table, possible headers include “metric is not powered”, “there is no detectable change”, “we’re confident we detected a change”, and “directional change is correct but magnitude might be inflated” when early stopping is reached but the metric is not powered yet.   

Figure 6: User interface after changes.

Even after making these UI changes, making a decision on when to stop an experiment and whether or not to launch it is not always simple. Generally some things we advise our stakeholders to consider are:

  1. Do we have statistically significant results that support our hypothesis?
  2. Do we have statistically significant results that are positive but aren’t what we anticipated?
  3. If we don’t have enough data yet, can we just keep it running or is it blocking other experiments?
  4. Is there anything broken in the product experience that we want to correct, even if the metrics don’t show anything negative?
  5. If we have enough information on the main metrics overall, do we have enough information to iterate? For example, if we want to look at impact on a particular segment, which could be 50% of the traffic, then we’ll need to run the experiment twice as long as we had to in order to look at the overall impact.

We hope that these UI changes will help our stakeholders make better informed decisions while still letting them uncover cases where they have changed something more dramatically than expected and thus can stop the experiment sooner.

Further Discussion

In this section, we discuss a few more issues we examined while designing Etsy’s solutions to peeking.

Trade-off Between Power and Significance

There is a trade-off between Type I (false positive) and Type II (false negative) errors – if we decrease the probability of one of the errors, the probability of the other will increase – for a more detailed explanation, please see this short post. This translates into a trade-off between p-value and power because if we require stronger evidence to reject the null hypothesis (i.e.  a smaller p-value threshold), then there is a smaller chance that we will be able to correctly reject a false null hypothesis a.k.a decreased power. The different messages we display on the user interface balance this issue to some degree. At the end, it is just a choice that we have to make based on our priorities and focus in experimentation.

Weekend vs. Weekday Data Sample Size

At Etsy, the volume of traffic and intent of visitors varies from weekdays to weekends. This is not a concern for the sequential testing approach that we ultimately chose. However, it would be an issue for some other methods that require equal daily data sample size. During our research, we looked into ways to handle the inconsistency in our daily data sample size. We found that the GroupSeq package in R, which enables the construction of group sequential designs and has various alpha spending functions available to choose among, is a good way to account for this.

Other Types of Designs

The sequential sampling method that we have designed is a straightforward form of a stopping rule modified to best suit our needs and circumstances. However, there are other types of sequential approaches that are more formally defined, such as the Sequential Probability Ratio Test (SPRT), which is utilized by Optimizely’s New Stats Engine4, and the Sequential Generalized Likelihood Ratio test, which has been used in clinical trials11. There has also been debate in both academic and industry about the effectiveness of Bayesian A/B testing in solving the peeking problem2, 5. It is indeed a very interesting problem!

Final Thoughts

Accurate interpretation of statistical data is crucial in making informed decisions about product development. When online experiments have to be run efficiently to save time and cost, we inevitably run into dilemmas unique to our context, and peeking is just one of them. In researching and designing solutions to this problem, we examined some more rigorous theoretical work. However, the characteristics and priorities in online experimentation makes the application of it difficult. Our approach outlined in this post, even though simple, addresses the root cause of the peeking problem effectively. Looking forward, we think the balance between statistical rigorousness and practical constraints is what makes online experimentation intriguing and fun to work on, and we at Etsy are very excited about tackling more interesting problems awaiting us.

This work is a collaboration between Callie McRee and Kelly Shen from the Analytics and Analytics Engineering teams. We would like to thank Gerald van den Berg, Emily Robinson, Evan D’Agostini, Anastasia Erbe, Mossab Alsadig, Lushi Li, Allison McKnight, Alexandra Pappas, David Schott and Robert Xu for helpful discussions and feedback.

References

  1. How Not to Run an A/B Test by Evan Miller
  2.  Is Bayesian A/B Testing Immune to Peeking? Not Exactly by David Robinson
  3.  Peeking at A/B tests: why it matters, and what to do about it by Johari et al., KDD’17
  4.  The New Stats Engine by Pekelis, et al., Optimizely
  5.  Continuous monitoring of A/B tests without pain: optional stopping in Bayesian testing by Deng, Lu, et al., CEUR’17
  6.  Trial sans Error: How Pharma-Funded Research Cherry-Picks Positive Results by Ben Goldacre of Scientific American, February 13, 2013
  7.  False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant by Simmons, Simonsohn, et al. (2011), Psychological Science, 22
  8. Interim Analyses and Sequential Testing in Clinical Trials by Nicole Solomon, BIOS 790, Duke University
  9. A Pocock approach to sequential meta-analysis of clinical trials by Shuster, J. J., & Neu, J. (2013), Research Synthesis Methods, 4(3), 10.1002/jrsm.1088
  10.  Simple Sequential A/B Testing by Evan Miller
  11.  Sequential Generalized Likelihood Ratio Tests for Vaccine Safety Evaluation by Shih, M.-C., Lai, T. L., Heyse, J. F. and Chen, J. (2010), Statistics in Medicine, 29: 2698-2708

1 Comment

How Etsy Localizes Addresses

Posted by on September 26, 2018 / 2 Comments

Imagine you’re browsing the web from your overpriced California apartment one day and you find a neat new website with some really cool stuff. You pick out a few items, add them to your cart, and start the checkout process. You get to the part where they ask for your shipping address and this is the form you see:

 

 

It starts off easy – you fill in your name, your street, your apartment number, and then you reach the field labelled “Post Town”. What is a “Post Town”? Huh. Next you see “County”. Well, you know what a county is, but since when do you list it in your address? Then there’s “Postal code”. You might recognize it as what the US calls a “zip code”, but it’s still confusing to see.

So now you don’t really know what to do, right? Where do you put your city, or your state? Do you just cram your address into the form however you can and hope that you get your order? Or do you abandon your cart and decide not to buy anything from this site?

 

This is in fact a fake form I put together for this exercise, but it demonstrates exactly how a lot of our members outside the United States felt when we showed them  a US-centric address form.

 

Etsy does business in more than 200 countries, and it’s important that our members feel comfortable and confident when providing their shipping address and placing orders. They need to know that their orders are going to reach them. Furthermore, we ask for several addresses when new sellers join Etsy, like billing and banking addresses, and we want them to feel comfortable providing this information, and confident that we will know how to support them and their needs.

 

In this post I’ll cover:

 

Where we started: Generic Address Forms

When we first started this project, our address forms were designed in a variety of technologies, with minimal localization. Most address forms worked well for US members, and we displayed the appropriate forms for Canadian and Australian members, but members in every other country were shown a generic, unlocalized address form.

United States

Our address form for US members looks just fine – all the fields we expect to see are there, and there aren’t any unexpected fields.

Germany (et al)

 

This is the form we showed for Germany (and most other countries). We asked for a postal code, and a state, province, or region. For someone unfamiliar with German addresses, this might seem fine at first. But what if I told you that German addresses don’t have states? Even worse, in this form, state is a required field!

 

This form confused a lot of our German members, and they ended up putting any number of things in that field, just to be able to move forward. This led us to saved addresses like:

Ets Y. Crafter

123 Main Street

Berlin, Berlin 12435

Germany

In this case, the member just entered the city in the state field. This wasn’t the worst situation, and anything shipped to this address would probably arrive just fine.

But what about this address?

Ets Y. Crafter

123 Main Street

Berlin, California 12435

Germany

Sometimes the member entered a US state in the state field. This confused sellers and postal workers alike – we had real life examples of packages being shipped to the US because a state was included, even though the country listed was something totally different!

Ets Y. Crafter

123 Main Street

Berlin, dasdsaklklg

Germany

Members could even enter gibberish in the state field. Again, this was a bit confusing for sellers and the postal service.

 

What are non-US addresses supposed to look like?

Here’s an example of a German address:

Ets Y. Crafter

123 Main Street

12435 BERLIN

Germany

If you wanted to mail something to this address, you’d need to specify the recipient, the street name and number, the postal code and city, and the country. We could have used this address to determine an address format for Germany, but what about the almost 200 other countries Etsy supports? We didn’t really want look up sample addresses for each country and guess at what the address format should be.

Thankfully, we didn’t have to do that.

We drew on 3 different sources when compiling a list of address formats for different countries.

So what kind of data did we get?

The most important piece of information we got is a format string that tells us:

We also got other formatting data, including:

Here’s what the formatting data looks like for a couple of different countries, and how that data is used to assemble the localized address form for that country.

United States

$format = [
  209 => [
      'format' => '%name\n%first_line\n%second_line\n%city, %state %zip\n%country_name',
      'required_fields' => [
          'name',
          'first_line',
          'city',
          'state',
          'zip',
      ],
      'uppercase_fields' => [
          'city',
          'state',
      ],
      'name' => 'UNITED STATES',
      'administrative_area_type' => 'state',
      'locality_type' => 'city',
      'postal_code_type' => 'zip',
      'postal_code_pattern' => '(\\d{5})(?:[ \\-](\\d{4}))?',
      'administrative_areas' => [
          'AL' => 'Alabama',
          'AK' => 'Alaska',
          ...
          'WI' => 'Wisconsin',
          'WY' => 'Wyoming',
      ],
      'iso_code' => 'US',
  ]
];

 

Germany

$format = [
  91 => [
    'format' => '%name\n%first_line\n%second_line\n%zip %city\n%country_name',
    'required_fields' => [
        'name',
        'first_line',
        'city',
        'zip',
    ],
    'uppercase_fields' => [
        'city',
    ],
    'name' => 'GERMANY',
    'locality_type' => 'city',
    'postal_code_type' => 'postal',
    'postal_code_pattern' => '\\d{5}',
    'input_format' => '%name\\n%first_line\\n%second_line\\n%zip\\n%city\\n%country_name',
    'iso_code' => 'DE',
  ],
];

 

So, now we had all this great information on what addresses were supposed to look like for almost 200 countries. How did we take this data and turn it into a localized address experience?

 

Building a Localized Address Experience

A complete localized address experience requires two components: address input and address display. In other words, our members need to be able to add and edit their addresses using a form that makes sense to them, and they need to see their address displayed in a format that they’re familiar with.

Address Input

You’ve already seen what our unlocalized address form looked like, but here’s a quick reminder of what German members were seeing when they were entering their addresses.

 

This was a static form, meaning we had a big template with a bunch of <input> tags, and a little bit of JavaScript to handle interactions with the form. For a few select countries, like Canada and Australia, we added conditional statements to the template, swapping in different state or province lists as necessary. It made for a pretty messy template.

When deciding how we wanted to handle address forms, we knew that we didn’t want to have a template with enough conditional statements to handle hundreds of different address formats. Instead, we decided on a compositional approach.

Every address form starts with a country <select> input. This prompts the member to select their country first, so we can render the localized form before they start entering their address. We identified all the possible fields that could be in an address form: first_line, second_line, city, state, zip, and country, and recognized that all these fields could be rendered using just a few generic templates. These templates would allow us to specify custom labels, indicate whether or not the field was required, display validation errors, and render other custom content by providing different data when we render the template for each field.   

Text Input

A pretty basic text input can be used for the first line, second line, city, and zip address fields, as well as the state field, depending on the country. Here’s what our text input template looks like:

 

State Select Input

For the countries for which we have state (aka administrative area) data, we created a select input template:

With these templates, and the appropriate address formatting data, we can generate address input forms for almost 200 countries.

 

Address Display

Displaying localized addresses was also handled by a static template before this project. It was based on the US address format, and was written with the assumption that all addresses had the same fields as US addresses. It looked something like this:

<p>{{name}}</p>
<p>{{first_line}}</p>
<p>{{second_line}}</p>
<p>{{city}}, {{state}} {{zip}}</p>
<p>{{country_name}}</p>

While this wasn’t as problematic as the way we were handling address input, it was still not ideal. Addresses for international members would be displayed incorrectly, causing varying levels of confusion.

For German members, the difference wasn’t too bad:

 

But for members in countries like Japan, the difference was pretty significant:

 

To localize address display, we went with a compositional approach again, treating each field as a separate piece, and then combining them in the order specified, and using the delimiters indicated by the format string.

<span class="name">Ets Y. Crafter</span><br>
<span class="first-line">123 Main Street</span><br/>
<span class="zip">12345</span> <span class="city">BERLIN</span><br/>
<span class="country-name">Germany</span><br/>

We further enhanced our address display library by creating a PHP class that could render a localized address in plaintext, or fully customizable HTML, to support the numerous ways addresses are displayed throughout Etsy and our internal tools.

Conclusion

No more confusing address forms! While we’re nowhere near finished with localized addresses, we’ve made really great progress so far. We’re hopeful that our members will enjoy their experience just a little bit more now that they have fewer concerns when it comes to addresses. There is a lot more that we learned from this project (like how we replaced the unlocalized address forms with the localized address form on the entire site!), so keep an eye out for future blog posts. Thanks for reading!

2 Comments

Modeling User Journeys via Semantic Embeddings

Posted by on July 12, 2018 / No Responses

Etsy is a global marketplace for unique goods. This means that as soon as an item becomes popular, it runs the risk of selling out. Machine learning solutions that simply memorize the popular items are not as effective, and crafting features that generalize well across items in our inventory is important. In addition, some content features such as titles are sometimes not as informative for us since these are seller provided, and can be noisy.

In this blog post, I will cover a machine learning technique we are using at Etsy that allows us to extract meaning from our data without the use of content features like titles, modeling only the user journeys across the site. This post assumes understanding of machine learning concepts,  specifically word2vec.

What are embeddings?

Word2vec is a popular method in natural language processing for learning a semi-supervised model from unsupervised data to discover similarity across words in a corpus using an unlabelled body of text. This is done by relating co-occurrence of words and relies on the assumption that words that appear together are more related than words that are far apart.

This same method can be used to model user interactions on Etsy by modeling users journeys in aggregate as a sequence of user actions. Each user session is analogous to a sentence, and each user action (clicking on an item, visiting a shop’s home page, issuing a search query) is analogous to a word in NLP word2vec parlance. This method of modeling interactions allows us to represent items or other entities (shops, locations, users, queries) as low dimensional continuous vectors (semantic embeddings), where the similarity across two different vectors represents their co-relatedness. This method can be used without knowing anything about any particular user.

Semantic embeddings are agnostic to the content of items such as their titles, tags, descriptions, and allow us to leverage aggregate user interactions on the site to extract items that are semantically similar. In addition, they give us the ability to embed our search queries, items, shops, categories, and locations in the same vector space. This leads to better featurization and candidate selection across multiple machine learning problems, and provides compression, which drastically improves inference speeds compared to representing them as one-hot encodings. Modeling user journeys as a sequence of actions gives us information that is different from content-based methods that leverage descriptions and titles of items, and so these methods can be used in conjunction.

We have already productionized use of these embeddings across product recommendations, guided search experiences and they show great promise on our ranking algorithms as well. External to Etsy, similar semantic embeddings have been used to successfully learn representations for delivering ads as product recommendations via email and matching relevant ads to queries at Yahoo; and to improve their search ranking and derive similar listings for recommendations at AirBnB.

Approach

Etsy has over 50 million active items listed on the site from over 2 million sellers, and tens of millions of unique search queries each month. This amounts to billions of tokens (items or user actions – equivalent to word in NLP word2vec) for training. We were able to train embeddings on a single box, but we quickly ran into some limitations when modeling a sequence of user interactions as a naive word2vec model. The output embedding we constructed did not give us satisfactory performance. This gave us further assurance that some extensions to the standard word2vec implementation were necessary, and so extended the model with additional signals that are discussed below.

Skip-gram model and extensions

We initially started training the embeddings as a Skip-gram model with negative sampling (NEG as outlined in the original word2vec paper) method. The Skip-gram model performs better than the Continuous Bag Of Words (CBOW) model for larger vocabularies. It models the context given a target token and attempts to maximize the average likelihood of seeing any of the context tokens given a target token. The negative sampling draws a negative token from the entire corpus with a frequency that is directly proportional to the frequency of the token appearing in the corpus.

Training a Skip-gram model on only randomly selected negatives, however, ignores implicit contextual signals that we have found to be indicative of user preference in other contexts. For example, if a user clicks on the second item for a search query, the user most likely saw, but did not like, the first item that showed up in the search results. We extend the Skip-gram loss function by appending these implicit negative signals to the Skip-gram loss directly.

Similarly, we consider the purchased item in a particular session to be a global contextual token that applies to the entire sequence of user interactions. The intuition behind this is that there are many touch points on the user’s journey that help them come to the final purchase decision, and so we want to share the purchase intent across all the different actions that they took. This is also referred to as the linear multi-touch attribution model.

In addition, we want to be able to give a user journey that ended in a purchase more importance in the model. We define an importance weight per user interaction (click, dwell, add to cart, and purchase) and incorporate this to our loss function as well.

The details of how we extended Skip-gram are outside the scope of this post but can be found in detail in the Scalable Semantic Matching paper.

Training

We aim to learn a vector representation for each unique token, where a token can be listing id, shop id, query, category, or anything else that is part of a user’s  interaction. We were able to train embeddings up to 100 dimensions on a single box. Our final models take in billions of tokens and are able to produce embeddings for tens of millions of unique tokens.

User action can be broadly defined to any sort of explicit or implicit engagement of the user with the product. We extract user interactions from multiple sources such as the search, category, market, and shop home pages, where these interactions are aggregated and not tied to a particular user.

The model performed significantly better when we thresholded tokens based on their type. For example, the frequency count and distribution for queries tend to be very different from that of items, or shops. User queries are unbounded and have a very long tail, and order of magnitudes larger than the number of shops. So we want to capture all the shops in the embeddings vector space whereas limit queries or items based on a cutoff.

We also found a significant improvement in performance by training the model on the past year’s data for the current and upcoming month to add some forecasting capabilities, eg. for a model serving production in the month of December, last month December and January data was added, so our model would see more interactions related to Christmas during this time.

Training application specific models gave us better performance. For example, if we are interested in capturing shop level embeddings, training on the shops for an item instead of just the items directly yields better performance than averaging the embeddings for all items from a particular shop. We are actively experimenting with these models and plan to incorporate user and session specific data in the future.

Results

These are some interesting highlights of what the semantic embeddings are able to capture:

Note that all these relations are created without the model being fed any content features. These are results of the embeddings filtered to just search queries and projected onto tensorboard.

This first set of query similarities captures many different animals for the query jaguar.  The second set  shows the model also able to relate across different languages. Montre is watch in French, armbanduhr is wristwatch in German, horloge is clock in French, orologio da polso is wristwatch in Italian, uhren is again watch in German, and relog in Spanish.


Estate pipe signifies tobacco pipes that are previously owned. Here, we find the the model able to identity different items the pipe is made from (briar, corn cob, meerschaum), different brands of manufacturers (Dunhill and Peterson), and identifies accessories that are relevant to this particular type of pipe (pipe tamper) while not showing correlation with glass pipes that are not valid in this context. Content based methods have not been very effective in dealing with this. The embeddings are able to capture different styles, with boho, gypsy, hippie, gypsysoul all being related styles to bohemian.  

 

We found semantic embeddings to also provide better similar items to a particular item compared to a candidate set generation model that is based on content. This example comes from a model we released recently to generate similar items across shops.

For an item that is a cookie of steer and cacti design, we see the previous method latch onto content from the term ‘steer’ and ignore ‘cactus’, whereas the semantic embeddings place significance on cookies. We find that this has the advantage of not having to guess the importance of a particular item, and just rely on user engagement to guide us.

 

 

 

 

 

 

 

 

 

 

 

These candidates are generated based on a k-nn search across the semantic representations of items. We were able to run state of the art recall algorithms, unconstrained by memory on our training boxes themselves.

We are excited about the variety of different applications of this model ranging from personalization to ranking to candidate set selection. Stay tuned!

This work is a collaboration between Xiaoting Zhao and Nishan Subedi from the Search Ranking team. We would like to thank our manager, Liangjie Hong for insightful discussions and support, the Recommendation Systems and Search Ranking teams for their input during the project, specially Raphael Louca and Adam Henderson for launching products based on models, Stan Rozenraukh, Allison McKnight and Mohit Nayyar for reviewing this post, and Mihajlo Grbovic, leading author of the semantic embeddings paper for detailed responses to our questions.

No Comments

Deploying to Google Kubernetes Engine

Posted by on June 5, 2018 / 2 Comments

Late last year, Etsy announced that we’ll be migrating our services out of self-managed data centers and into the cloud. We selected Google Cloud Platform (GCP) as our cloud provider and have been working diligently to migrate our services. Safely and securely migrating services to the cloud requires them to live in two places at once (on-premises and in the cloud) for some period of time.

In this article, I’ll describe our strategy specifically for deploying to a pair of Kubernetes clusters: one running in the Google Kubernetes Engine (GKE) and the other on-premises in our data center. We’ll see how Etsy uses Jenkins to do secure Kubernetes deploys using authentication tokens and GCP service accounts. We’ll learn about the challenge of granting fine-grained GKE access to your service accounts and how Etsy solves this problem using Terraform and Helm.

Deploying to On-Premises Kubernetes

Etsy, while new to the Google Cloud Platform, is no stranger to Kubernetes. We have been running our own Kubernetes cluster inside our data center for well over a year now, so we already have a partial solution for deploying to GKE, given that we have a system for deploying to our on-premises Kubernetes.

Our existing deployment system is quite simple from the perspective of the developer currently trying to deploy: simply open up Deployinator and press a series of buttons! Each button is labeled with its associated deploy action, such as “build and test” or “deploy to staging environment.”

Under the hood, each button is performing some action, such as calling out to a bash script or kicking off a Jenkins integration test, or some combination of several such actions.

For example, the Kubernetes portion of a Search deploy calls out to a Jenkins pipeline, which subsequently calls out to a bash script to perform a series of “docker build”, “docker tag”, “docker push”, and “kubectl apply” steps.

Why Jenkins, then? Couldn’t we perform the docker/kubectl actions directly from Deployinator?

The key is in… the keys! In order to deploy to our on-premises Kubernetes cluster, we need a secret access token. We load the token into Jenkins as a “credential” such that it is stored securely (not visible to Jenkins users), but we can easily access it from inside Jenkins code.

Now, deploying to Kubernetes is a simple matter of looking up our secret token via Jenkins credentials and overriding the “kubectl” command to always use the token.

Our Jenkinsfile for deploying search services looks something like this:

All of the deploy.sh scripts above use environment variable $KUBECTL in place of standard calls to kubectl, and so by wrapping everything in our withKubernetesEnvs closure, we have ensured that all kubectl actions are using our secret token to authenticate with Kubernetes.

Declarative Infrastructure via Terraform

Deploying to GKE is a little different than deploying to our on-premises Kubernetes cluster and one of the major reasons is our requirement that everything in GCP be provisioned via Terraform. We want to be able to declare each GCP project and all its resources in one place so that it is automatable and reproducible. We want it to be easy—almost trivial—to recreate our entire GCP setup again from scratch. Terraform allows us to do just that.

We use Terraform to declare every possible aspect of our GCP infrastructure. Keyword: possible. While Terraform can create our GKE clusters for us, it cannot (currently) create certain types of resources inside of those clusters. This includes Kubernetes resources which might be considered fundamental parts of the cluster’s infrastructure, such as roles and rolebindings.

Access Control via Service Accounts

Among the objects that are currently Terraformable: GCP service accounts! A service account is a special type of Google account which can be granted permissions like any other user, but is not mapped to an actual user. We typically use these “robot accounts” to grant permissions to a service so that it doesn’t have to run as any particular user (or as root!).

At Etsy, we already have “robot deployer” user accounts for building and deploying services to our data center. Now we need a GCP service account which can act in the same capacity.

Unfortunately, GCP service accounts only (currently) provide us with the ability to grant complete read/write access to all GKE clusters within the same project. We’d like to avoid that! We want to grant our deployer only the permissions that it needs to perform the deploy to a single cluster. For example, a deployer doesn’t need the ability to delete Kubernetes services—only to create or update them.

Kubernetes provides the ability to grant more fine-grained permissions via role-based access control (RBAC). But how do we grant that kind of permission to a GCP service account?

We start by giving the service account very minimal read-only access to the cluster. The service account section of the Terraform configuration for the search cluster looks like this:

We have now created a service account with read-only access to the GKE cluster. Now how do we associate it with the more advanced RBAC inside GKE? We need some way to grant additional permissions to our deployer by using a RoleBinding to associate the service account with a specific Role or ClusterRole.

Solving RBAC with Helm

While Terraform can’t (yet) create the RBAC Kubernetes objects inside our GKE cluster, it can be configured to call a script (either locally or remotely) after a resource is created.

Problem solved! We can have Terraform create our GKE cluster and the minimal deployer service account, then simply call a bash script which creates all the Namespaces, ClusterRoles, and RoleBindings we need inside that cluster. We can bind a role using the service account’s email address, thus mapping the service account to the desired GKE RBAC role.

However, as Etsy has multiple GKE clusters which all require very similar sets of objects to be created, I think we can do better. In particular, each cluster will require service accounts with various types of roles, such as “cluster admin” or “deployer”. If we want to add or remove a permission from the deployer accounts across all clusters, we’d prefer to do so by making the change in one place, rather than modifying multiple scripts for each cluster.

Good news: there is already a powerful open source tool for templating Kubernetes objects! Helm is a project designed to manage configured packages of Kubernetes resources called “charts”.

We created a Helm chart and wrote templates for all of the common resources that we need inside GKE. For each GKE cluster, we have a yaml file which declares the specific configuration for that cluster using the Helm chart’s templates.

For example, here is the yaml configuration file for our production search cluster:

And here are the templates for some of the resources used by the search cluster, as declared in the yaml file above (or by nested references inside other templates)…

When we are ready to apply a change to the Helm chart—or Terraform is applying the chart to an updated GKE cluster—the script which applies the configuration to the GKE cluster does a simple “helm upgrade” to apply the new template values (and only the new values! Helm won’t do anything where it detects that no changes are needed).

Integrating our New System into the Pipeline

Now that we have created a service account which has exactly the permissions we require to deploy to GKE, we only have to make a few simple changes to our Jenkinsfile in order to put our new system to use.

Recall that we had previously wrapped all our on-premises Kubernetes deployment scripts in a closure which ensured that all kubectl commands use our on-premises cluster token. For GKE, we use the same closure-wrapping style, but instead of overriding kubectl to use a token, we give it a special kube config which has been authenticated with the GKE cluster using our new deployer service account. As with our secret on-premises cluster token, we can store our GCP service account key in Jenkins as a credential and then access it using Jenkins’ withCredentials function.

Here is our modified Jenkinsfile for deploying search services:

And there you have it, folks! A Jenkins deployment pipeline which can simultaneously deploy services to our on-premises Kubernetes cluster and to our new GKE cluster by associating a GCP service account with GKE RBAC roles.

Migrating a service from on-premises Kubernetes to GKE is now (in simple cases) as easy as shuffling a few lines in the Jenkinsfile. Typically we would deploy the service to both clusters for a period of time and send a percentage of traffic to the new GKE version of the service under an A/B test. After concluding that the new service is good and stable, we can stop deploying it on-premises, although it’s trivial to switch back in an emergency.

Best of all: absolutely nothing has changed from the perspective of the average developer looking to deploy their code. The new logic for deploying to GKE remains hidden behind the Deployinator UI and they press the same series of buttons as always.

Thanks to Ben Burry, Jim Gedarovich, and Mike Adler who formulated and developed the Helm-RBAC solution with me.

2 Comments

The EventHorizon Saga

Posted by on May 29, 2018 / No Responses

This is an epic tale of EventHorizon, and how we finally got it to a happy place.

EventHorizon is a tool we use to watch events streaming into our system. Events (also known as beacons) are basically clickstream data—a record of actions visitors take on our site, what content they saw, what experiments they were bucketed into, etc.

Events are sent primarily from our web & API servers (backend events) and web browsers (frontend events), and logged in Kafka. EventHorizon is primarily used as a debugging tool, to make sure that a new event you’ve added is working as expected, but also serves to monitor the health of our event system.

Screenshot of the EventHorizon web page, showing events streaming in

EventHorizon UI

EventHorizon is pretty simple; it’s only around 200 lines of Go code. It consumes messages from our main event topic on Kafka (“beacon-main”) and forwards them via WebSockets to any connected browsers. Ideally, the time between when an event is fired on the web site and when it appears in the EventHorizon UI is on the order of milliseconds.

EventHorizon has been around for years, and in the early days, everything was hunky-dory. But then, the lagging started.

Nobody Likes a Laggard

As with most services at Etsy, we have lots of metrics we monitor for EventHorizon. One of the critical ones is consumer lag, which is the age of the last message we’ve read from the beacon-main Kafka topic. This would normally be milliseconds, but occasionally it would start lagging into minutes or even hours.

Graph showing EventHorizon consumer lag increasing from zero to over 1.7 hours, then back down again

EventHorizon Consumer Lag

Sometimes it would recover on its own, but if not, restarting EventHorizon would often fix the problem, but only temporarily. Within anywhere from a few hours to a few weeks we’d notice lag time beginning to grow again.

We spent a lot of time pouring over EventHorizon’s code, thinking we had a bug somewhere. It makes use of Go’s channels—perhaps there was a subtle concurrency bug that was causing a deadlock? We fiddled with that, but it didn’t help.

We noticed that we could sometimes trigger the lag if two users connected to EventHorizon at the same time. This clue led us to think that there was a bug somewhere in the code that sent events to the browsers. Something with Websockets? We considered rewriting it to use Server-sent Events, but never got around to that.

We also wondered if the sheer quantity of events we were sending to browsers was causing the problem. We updated EventHorizon to only send events from Etsy employees to browsers in production. Alas, the lag didn’t go away—although seemed to have gotten a little better.

We eventually moved onto other things. We set up a Nagios alert for when EventHorizon started lagging, and just restarted it when it got bad. Since it would often be fine for 2-3 weeks before lagging, spending more time trying to fix it wasn’t a top priority.

Orloj Joins the Fray

In September 2017 EventHorizon lag had gotten really bad. We would restart it and it would just start lagging again immediately. At some point we even turned off the Nagios alert.

However, another system, Orloj (pronounced “OR-loy”, named after the Prague astronomical clock), had started experiencing lag as well. Orloj is another Kafka consumer, responsible for updating the Recently Viewed Listings that are shown on the homepage when are you are signed in. As Orloj is a production system, figuring out what was happening became much more urgent.

Orloj’s lag was a little different: lag would spike once an hour, whenever the Hadoop job that pulls beacons down from Kafka into HDFS ran, and at certain times of the day it would be quite significant.

Graph showing the Orloj service lagging periodically

Orloj Periodic Lag

It turned out that due to a misconfiguration, KafkaPullJob, which was only supposed to launch one mapper per Kafka partition (of which we have 144 for beacon-main), was actually launching 400 mappers, which was swamping the network. We fixed this, and Orloj was happy again.

For about a week.

Trouble with NICs

Orloj continued to have issues with lag. While digging into this, I realized that the machines in the Kafka clusters only had 1G network interfaces (NICs), whereas 10G NICs were standard in most of our infrastructure. I talked to our networking operations team to ask about upgrading the cluster and one of the engineers asked what was going on with one particular machine, kafkautil01. The network graph showed that its bandwidth was pegged at 100%, and had been for a while. kafkautil01 also had a 1G NIC. And that’s where EventHorizon ran.

A light bulb exploded over my head.

Relaying this info to Kevin Gessner, the engineer who wrote Orloj, he said “Oh yeah, consuming beacon-main requires at least 1.5 Gbps.” Suddenly it all made sense.

Beacon traffic fluctuates in proportion to Etsy site traffic, which is cyclical. Parts of the day were under 1 Gbps, parts over, and when it went over, EventHorizon couldn’t keep up and would start lagging. And we were going over more and more often as Etsy grew.

And remember the bit about two browsers connecting at once triggering lag? With EventHorizon forwarding the firehose of events to each browser, that was also a good way to push the network bandwidth over 1 Gbps, triggering lag.

We upgraded the Kafka clusters and the KafkaUtil boxes to 10G NICs and everything was fixed. No more lag!

Ha, just kidding.

Exploding Events

We did think it was fixed for a while, but EventHorizon and Orloj would both occasionally lag a bit, and it seemed to be happening more frequently.

While digging into the continuing lag, we discovered that the size of events had grown considerably. Looking at a graph of event sizes, there was a noticeable uptick around the end of August.

Graph showing the average size of event beacons increasing from around 5K to 7K in the period of a couple months

Event Beacon Size Increase

This tied into problems we were having with capacity in our Hadoop cluster—larger events mean longer processing times for nearly every job.

Inspecting event sizes showed some clear standouts. Four search events were responsible for a significant portion of all event bandwidth. The events were on the order of 50KB each, about 10x the size of a “normal” event. The culprit was some debugging information that had been added to the events.

The problem was compounded by something that has been part of our event pipeline since the beginning: we generate complementary frontend events for each backend primary event (a “primary event” is akin to a page view) to capture browser-specific data that is only available on the frontend, and we do it by first making a copy of the entire event and then adding the frontend attributes. Later, when we added events for tracking page performance metrics, we did the same thing. These complementary events don’t need all the custom attributes of the original, so this is a lot of wasted bandwidth. So we stopped doing that.

Between slimming down the search events, not copying attributes unnecessarily, and finding a few more events that could be trimmed down, we managed to bring down the average event size, as well as event volume, considerably.

Graph showing the event beacon message size dropping from 7K to 3.5K in the space of a week

Event Beacon Size Decrease

Nevertheless, the lag persisted.

The Mysterious Partition 20

Orloj was still having problems, but this time it was a little different. The lag seemed to be happening only on a single partition, 20. We looked to see if the broker that was the leader for that partition was having any problems, and couldn’t see anything. We did see that it was serving a bit more traffic than other brokers, though.

The first thing that came to mind was a hot key. Beacons are partitioned by a randomly-generated string called a “browser_id” that is unique to a client (browser, native device, etc.) hitting our site. If there’s no browser_id, as is the case with internal API calls, it gets assigned to a random partition.

I used a command-line Kafka consumer to try to diagnose. It has an option for only reading from a single partition. Here I sampled 100,000 events from partitions 20 and 19:

Partition 20

$ go run cmds/consumer/consumer.go -ini-files config.ini,config-prod.ini -topic beacon-main -partition 20 -value-only -max 100000 | jq -r '[.browser_id[0:6],.user_agent] | @tsv' | sort | uniq -c | sort -rn | head -5
    558 orcIq5  Dalvik/2.1.0 (Linux; U; Android 7.0; SAMSUNG-SM-G935A Build/NRD90M) Mobile/1 EtsyInc/4.77.0 Android/1
    540 null    Api_Client_V3/Bespoke_Member_Neu_Orchestrator
    400 ArDkKf  Dalvik/2.1.0 (Linux; U; Android 8.0.0; Pixel XL Build/OPR3.170623.008) Mobile/1 EtsyInc/4.78.1 Android/1
    367 hK8GHc  Dalvik/2.1.0 (Linux; U; Android 7.0; SM-G950U Build/NRD90M) Mobile/1 EtsyInc/4.75.0 Android/1
    366 EYuogd  Dalvik/2.1.0 (Linux; U; Android 7.0; SM-G930V Build/NRD90M) Mobile/1 EtsyInc/4.77.0 Android/1

Partition 19

$ go run cmds/consumer/consumer.go -ini-files config.ini,config-prod.ini -topic beacon-main -partition 19 -value-only -max 100000 | jq -r '[.browser_id[0:6],.user_agent] | @tsv' | sort | uniq -c | sort -rn | head -5
    570 null    Api_Client_V3/Bespoke_Member_Neu_Orchestrator
    506 SkHj7N  Dalvik/2.1.0 (Linux; U; Android 7.0; LG-LS993 Build/NRD90U) Mobile/1 EtsyInc/4.78.1 Android/1
    421 Jc36zw  Dalvik/2.1.0 (Linux; U; Android 7.0; SM-G930V Build/NRD90M) Mobile/1 EtsyInc/4.78.1 Android/1
    390 A586SI  Dalvik/2.1.0 (Linux; U; Android 8.0.0; Pixel Build/OPR3.170623.008) Mobile/1 EtsyInc/4.78.1 Android/1
    385 _rD1Uj  Dalvik/2.1.0 (Linux; U; Android 7.0; SM-G935P Build/NRD90M) Mobile/1 EtsyInc/4.77.0 Android/1

I couldn’t see any pattern, but did notice we were getting a lot of events from the API with a null browser_id. These appeared to be distributed evenly across partitions, though.

We were seeing odd drops and spikes in the number of events going to partition 20, so I thought I’d see if I could just dump events around that time, so I started digging into our beacon consumer command-line tool to try to do that. In this process, I came across the big discovery: the -partition flag I had been relying on wasn’t actually hooked up to anything. So I was never consuming from a particular partition, but from all partitions. Once I fixed this, the problem was obvious:

Partition 20

$ go run cmds/consumer/consumer.go -ini-files config.ini,config-prod.ini -topic beacon-main -q -value-only -max 10000 -partition 20 | jq -r '[.browser_id[0:6],.user_agent] | @tsv' | sort | uniq -c | sort -nr | head -5
   8268 null    Api_Client_V3/Bespoke_Member_Neu_Orchestrator
    335 null    Api_Client_V3/BespokeEtsyApps_Public_Listings_Offerings_FindByVariations
    137 B70AD9  Mozilla/5.0 (iPhone; CPU iPhone OS 11_1_1 like Mac OS X) AppleWebKit/604.3.5 (KHTML, like Gecko) Mobile/15B150 EtsyInc/4.78 rv:47800.37.0
     95 C23BB0  Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 EtsyInc/4.78 rv:47800.37.0
     83 null    Api_Client_V3/Member_Carts_ApplyBestPromotions

and another partition for comparison:

$ go run cmds/consumer/consumer.go -ini-files config.ini,config-prod.ini -topic beacon-main -q -value-only -max 10000 -partition 0 | jq -r '[.browser_id[0:6],.user_agent] | @tsv' | sort | uniq -c | sort -nr | head -5
   1074 dtdTyz  Dalvik/2.1.0 (Linux; U; Android 7.0; VS987 Build/NRD90U) Mobile/1 EtsyInc/4.78.1 Android/1
    858 gFXUcb  Dalvik/2.1.0 (Linux; U; Android 7.0; XT1585 Build/NCK25.118-10.2) Mobile/1 EtsyInc/4.78.1 Android/1
    281 C380E3  Mozilla/5.0 (iPhone; CPU iPhone OS 11_0_3 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Mobile/15A432 EtsyInc/4.77 rv:47700.64.0
    245 E0464A  Mozilla/5.0 (iPhone; CPU iPhone OS 11_0_3 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Mobile/15A432 EtsyInc/4.78 rv:47800.37.0
    235 BAA599  Mozilla/5.0 (iPhone; CPU iPhone OS 11_0_3 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Mobile/15A432 EtsyInc/4.78 rv:47800.37.0

All the null browser_ids were going to partition 20. But how could this be? They’re supposed to be random.

I bet a number of you are slapping your foreheads right now, just like I did. I wrote this test of the hashing algorithm: https://play.golang.org/p/MJpmoPATvO

Yes, the browser_ids I was thinking were null were actually the string “null”, which is what was getting sent in the JSON event. I put in a fix for this, and:

Graph showing the number of events per second being processed by Orloj by partition, with partition 20 dropping significantly

Orloj Partition 20 Baumgartner

Lessons

I’m not going to attempt to draw any deep conclusions from this sordid affair, but I’ll close with some advice for when you find yourself with a persistent confounding problem like this one.

Graph everything. Then add some more graphs. Think about what new metrics might be helpful in diagnosing your issue. Kevin added the above graph showing Orloj events/sec by partition, which was critical to realizing there was a hot key issue.

If something makes no sense, think about what assumptions you’ve been making in diagnosing it, and check those assumptions. Kevin’s graph didn’t line up with what I was seeing, so I dug deeper into the command-line consumer and found the problem with the -partition flag.

Talk to people. Almost every small victory along the way came after talking with someone about the problem, and getting some crucial insight and perspective I was missing.

Keep at it. As much as it can seem otherwise at times, computers are deterministic, and with persistence, smart colleagues, and maybe a bit of luck, you’ll figure it out.

Epilogue

On November 13 Cloudflare published a blog post on tuning garbage collection in Go. EventHorizon and Orloj both spent a considerable percentage of time (nearly 10%) doing garbage collection. By upping the GC threshold for both, we saw a massive performance improvement:

Graph showing the GC pause time per sec dropping from 75+ ms to 592 µs

Graph showing the message processing time dropping significantly

Graph showing the EventHorizon consumer lag dropping from 25-125ms to near zero

Except for a couple brief spikes during our last Kafka upgrade, EventHorizon hasn’t lagged for more than a second since that change, and the current average lag is 2 ms.

Thanks to Doug Hudson, Kevin Gessner, Patrick Cousins, Rebecca Sliter, Russ Taylor, and Sarah Marx for their feedback on this post. You can follow me on Twitter at @bgreenlee.

No Comments

Sealed classes opened my mind

Posted by on April 12, 2018 / 3 Comments

How we use Kotlin to tame state at Etsy

Bubbly the Baby Seal by SmartappleCreations

Bubbly the Baby Seal by SmartappleCreations

Etsy’s apps have a lot of state.  Listings, tags, attributes, materials, variations, production partners, shipping profiles, payment methods, conversations, messages, snippets, lions, tigers, and even stuffed bears.  

All of these data points form a matrix of possibilities that we need to track.  If we get things wrong, then our buyers might fail to find a cat bed they wanted for Whiskers.  And that’s a world I don’t want to live in.

There are numerous techniques out there that help manage state.  Some are quite good, but none of them fundamentally changed our world quite like Kotlin’s sealed classes.  As a bonus, most any state management architecture can leverage the safety of Kotlin’s sealed classes.

What are sealed classes?

The simplest definition for sealed class is that they are “like enums with actual types.”  Inspired by Scala’s sealed types, Kotlin’s official documentation describes them thus:

Sealed classes are used for representing restricted class hierarchies, when a value can have one of the types from a limited set, but cannot have any other type. They are, in a sense, an extension of enum classes: the set of values for an enum type is also restricted, but each enum constant exists only as a single instance, whereas a subclass of a sealed class can have multiple instances which can contain state.

A precise, if a bit wordy, definition.  Most importantly: they are restricted hierarchies, they represent a set, they are types, and they can contain values.  

Let’s say we have a class Result to represent the value returned from a network call to an API.  In our simplified example, we have two possibilities: a list of strings, or an error.

 

Now we can use Kotlin’s when expression to handle the result. The when expression is a bit like pattern matching. While not as powerful as pattern matching in some other languages, Kotlin’s when expression is one of the language’s most important features.  

Let’s try parsing the result based on whether it is was a success or an error.

 

We’ve done a lot here in a just a few lines so let’s go over each aspect.  First we were able to check the type of result using Kotlin’s is operator, which is equivalent to Java’s instanceof operator.  By checking the type, Kotlin was able to smartcast the value of result for us for each case.  So if result is a success, we can access the value as if it is typed Result.Success.  Now we can can pass items without any type casting to showItems(items: List<String>).  If the result was an error, just print the stack trace to the console.  

The next thing we did with the when expression is to exhaust all the possibilities for the Result sealed class type.  Typically a when expression must have an else clause.  However, in the above example, there are no other possible types for Result, so the compiler, and IDE, know we don’t need an else clause.  This is important as it is exactly the kind of safety we’ve been longing for.  Look at what happens when we add a Cancelled type to the sealed class:

 

Now the IDE would show an error on our when statement because we haven’t handled the Cancelled branch of Result.

‘when’ expression must be exhaustive, add necessary ‘is Cancelled’ branch or ‘else’ branch instead.

 

The IDE knows we didn’t cover all our bases here.  It even knows which possible types result could be based on the Result sealed class.  Helpfully, the IDE offers us a quickfix to add the missing branches.  

There is one subtle difference here though.  Notice we assigned the when expression to a val. The requirement that a when expression must be exhaustive only kicks in if you use when as a value, a return type, or call a function on it.  This is important to watch out for as it can be a bit of a gotcha.  If you want to utilize the power of sealed classes and when, be sure to use the when expression in some way.

So that’s the basic power of sealed classes, but let’s dig a little deeper and see what else they can do.  

Weaving state

Let’s take a new example of a sealed class Yarn.

 

Then let’s create a new class call Merino and have it extend from Wool

 

What would be the effect of this on our when expression if we were branching on Yarn?  Have we actually created a new possibility that we must have a branch to accommodate?  In this case we have not, because a Merino is still considered part of Wool.  There are still only two branches for Yarn:

 

So that didn’t really help us represent the hierarchy of wool properly.  But there is something else we can do instead. Let’s expand the example of Wool to include Merino and Alpaca types of wool and make Wool into a sealed class.

 

Now our when expression will see each Wool subclass as unique branches that must be exhausted.

 

Only wool socks please

As our state grows in complexity however, there are times in which we may only be concerned about a subset of that state.  In Android it is common to have a custom view that perhaps only cares about the loading state. What can we do about that? If we use the when expression on Yarn, don’t we need to handle all cases?  Luckily Kotlin’s stdlib comes to the rescue with a helpful extension function.  

Say we are processing a sequence of events. Let’s create a fake sequence of all our possible states.

 

Now let’s say that we’re getting ready to knit a really warm sweater. Maybe our KnitView is only interested in the Wool.Merino and Wool.Alpaca states. How do we handle only those branches in our when expression?  Kotlin’s Iterable extension function filterIsInstance to the rescue!  Thankfully we can filter the tree with a simple single line of code.

 

And now, like magic, our when expression only needs to handle Wool states.  So now if we want to iterate through the sequence we can simply write the following:

 

Meanwhile at Etsy

Like a lot of apps these days we support letting our buyers and sellers login via Facebook or Google in addition to email and password.  To help protect the security of our buyers and sellers, we also offer the option of Two Factor Authentication. Adding additional complexity, a lot of these steps have to pause and wait for user input.  Let’s look at the diagram of the flow:

So how do we model this with sealed classes?  There are 3 places where we make a call out to the network/API and await a result.  This is a logical place to start. We can model the responses as sealed classes.

First, we reach out to the server for the social sign in request itself:

 

Next, is the request to actually sign in:

 

Finally, we have the two factor authentication for those that have enabled it:

 

This is a good start but let’s look at what we have here.  The first thing we notice is that some of the states are the same.  This is when we must resist our urges and instincts to combine them.  While they look the same, they represent discrete states that a user can be in at different times. It’s safer, and more correct to think of each state as different.  We also want to prefer duplication over the wrong abstraction.

The next thing we notice here is that these states are actually all connected. In some ways we’re starting to approach a Finite State Machine (FSM).  While an FSM here might make sense, what we’ve really done is to define a very readable, safe way of modeling the state and that model could be used with any type of architecture we want. 

The above sealed classes are logical and match our 3 distinct API requests — ultimately however, this is an imperfect representation.  In reality, there should be multiple instances of SignInResult for each branch of  SocialSignInResult,  and multiple instances of TwoFactorResult for each branch of SocialSignInResult.

For us three steps proved to be the right level of abstraction for the refactoring effort we were undertaking at the time.  In the future we may very well connect each and every branch in a single sealed class hierarchy. Instead of three separate classes, we’d use a single class that represented the complete set of possible states.  

 

Let’s take a look at what that would have looked like:

Note: to keep things simple I omitted the parameters  and used only regular subclasses

 

Our sealed class is now beginning to look a bit more busy, but we have successfully modeled all the possible states for an Etsy buyer or seller during login.  And now we have a discrete set of types to represent each state, and a type safe way to filter them using Iterable.filterIsInstance<Type>(iterable).

This is pretty powerful stuff.  As you can imagine we could connect a custom view to a stream of events that filtered only on 2FA states.  Or we could connect a progress “spinner” that would hide itself on some other subset of the states.

These ways of representing state opened up a clean way to react with business logic or changes on the UI.  Moreover, we’ve done it in a type-safe, expressive, and fluent way.

So with a little luck, now you can login and get that cat bed for Whiskers!

Special thanks to Cameron Ketcham who wrote most of this code!

3 Comments