Quality Matters: The Benefits of QA-Focused Retros

Posted by on February 8, 2016 / No Responses

Retrospectives (retros) are meetings that take place after every major project milestone to help product teams at Etsy improve processes and outcomes in the next iteration of a project. The insights gained from project retros are invaluable to proactively mitigating problems in future projects while promoting continuous learning.

I am one of the managers on the Product Quality (PQ) team, which is Etsy’s centralized resource for manual QA. When manual QA was first introduced at Etsy, testers joined a team for the duration of the project but had limited opportunity  to get objective feedback. Consequently, testers were kept from seeing the importance of their contributions and from viewing themselves as true members of the product team they worked with. This lack of communication and continuous learning left our testers with less job satisfaction and feelings of empowerment than their peers in product engineering.

We decided to try QA-focused retros to surface feedback that would help us identify repeatable behaviors that contribute to consistently successful product launches. We were also interested in empowering Product Quality team members to understand how their specific contributions impact product launches and allow them to take more ownership of their responsibilities.

Regularly scheduled QA retros have helped to promote mindfulness and accountability on the PQ team. Over time, they have solidified relationships with product owners, designers, and engineers and reinforced the sense that we are all working toward the same goal.

Here’s some information to help you run a QA-focused retro:

Meeting Goals

Sample Agenda

How Does It Work?

The retro should be scheduled after the initial product launch or any milestone that required a significant QA effort. Participants should receive the agenda ahead of time and be asked to come prepared with their thoughts on the main questions.  The QA engineer who led the testing effort should facilitate the retro by prompting attendees, in a round robin fashion, to weigh in on each agenda item.

The facilitator also participates by giving insights to the product team they partnered with for launch. This 30-minute activity is best if scheduled near the end of the day and can be augmented with snacks and beverages to create a relaxed atmosphere. The facilitator should record insights and report any interesting findings back to the QA team and the product team they worked with on the project.

Who Should Attend?

Participants should be limited to those who directly interacted with QA during development. These are usually:

What Happens Next?

Everyone on the Product Quality team reviews the takeaways from the retro and clarifies any questions with those who participated in the retro. We then make the applicable adjustments to processes prior to the next iteration of a project. All changes made to the PQ process are communicated to team members and product teams as needed.

Conclusions

QA-focused retros have empowered our team to approach testing as a craft that is constantly honed and developed. The meetings help product managers learn how to get the most out of working with the QA team and provide opportunities for product teams to participate in streamlining the QA process.

No Comments

Leveling Up Your Organization With System Reviews

Posted by on December 21, 2015 / 1 Comment

One of Etsy’s core engineering philosophies has been to push decisions to the edges wherever possible. Rather than making dictatorial style decisions we enable people to make their own choices given the feedback of a group. We use the system review format to tackle organizational challenges that surface from a quickly growing company. A system review is a model that enables a large organization to take action on organizational and cultural issues before they become larger problems. They’re modeled after shared governance meetings – a disciplined yet inclusive format that enables organizational progress and high levels of transparency. This form of leadership values continued hard work, open communication, trust and respect.

This idea was introduced by our Learning and Development team, who among other things run our manager training programs, our feedback system, and our dens program (a vehicle for confidential, small group discussions about leadership and challenges at work). A few years ago, we started bringing the engineering leadership group together on a recurring basis, but the agenda and outcome of these meetings were unclear. We always had something to talk about and discuss, but it was difficult to move forward and address issues. We were looking for something that was better facilitated, and for ways to bring our engineering leadership team together to provide the clear outcome of helping solve some of our organizational challenges. L&D provided us with facilitation training and an overview of different meeting formats to use in different situations. Over time we’ve tested out some of these new meeting types, and the system review is one of the many formats we’ve learned to apply. They coached us through facilitating the first series of these new formats in our meetings over the the first several months.  We’re extremely fortunate to have a team of smart, focused individuals that have the background in providing solutions for these types of problems. We are sharing these insights here for the benefit of anyone interested in the topic.

These meetings can work well for anything from small groups of 20 up to large groups of 300 people. They may take a few times to get the hang of, but once you get into a rhythm it becomes an efficient format to survey a large group and take action on important issues. They should be held at a regular cadence, such as monthly or quarterly such that it creates a feedback loop of proactively raising new problems to address, while reporting findings and potential solutions to previously discussed topics.

When we are reviewing system issues we are looking into the following things:

System review meetings are based around a specific prompt taken from these areas, for example “In what area of the engineering org do you feel there is currently a high degree of confusion or frustration?”.

What does the agenda look like?

This type of meeting needs to be timed and facilitated. The agenda is pretty straightforward, it’s important that the group observe the timer that the facilitator maintains and respect that they are moving the meeting along within the confines of a one-hour timeframe. Facilitation is really an art in itself, and there are a lot of resources out there to help with improving the technique. Generally the facilitator should not be emotionally invested in the topic so they can remove themselves from the conversation and focus on moving the meeting forward. They should set and communicate the agenda, and make sure the room is set up to support the format and that the appropriate materials are provided. They should set the stage for the meeting – let people know why they are there, an overview of what they’ll be doing, and why it’s important. They should manage the agenda and timing. The timing is somewhat flexible within the greater time frame and should be adjusted as necessary based on the discussions taking place. It’s possible that a conversation is deeper than a time slot allows but the facilitator decides on the fly that it is important enough to cut time from another part of the meeting. However, every time the timer is ignored, the group slides away from healthy order and towards bad meeting anarchy, so it’s the facilitator’s job to keep this in check. To do this effectively, the facilitator needs to be comfortable interrupting the group to keep the meeting on track. And lastly the facilitator should close the meeting on time, summarize any key actions, agreements and next steps.

Below is the agenda for the high level format of the meeting. After presenting the prompt chosen for the meeting, the facilitator should divide the attendees into groups of approximately five people. Each member of the group individually generates ideas about it (sticky notes work great to collect these, one idea per sticky). Within these small groups, everyone shares their top two issues and the impact they feel each has. After everyone has shared theirs, the group should vote and tally the top three issues.  

The facilitator will then have everyone come back together as the larger group. Each of the subgroups will share the three things that they upvoted. After each round the larger group can ask clarifying questions. It’s a good idea if the facilitator maintains a spreadsheet with all of the ideas so that everyone can refer back to it. It comes in handy because the next phase is for everyone to vote on the issues. Take the top 3 – 5 issues as something to move forward on investigating.

Sample Agenda

Prompt:  In what area of the engineering org do you feel there is currently a high degree of confusion or frustration?

Small Groups (20 mins timed by facilitator)

  1. Solo brainstorm (2 mins full group)
  2. Round-robin: share top 2 issues + their impact (2 mins per person)
  3. Group vote: vote and tally top three (5 mins)

Full Group (30 mins)

  1. Each group shares their three upvoted (2 mins per group)
  2. Clarifying Questions asked (3 mins per round)
  3. Full vote: Write 3 votes on post-its (3 mins)
  4. Drivers volunteer

Next Steps

After we have settled on the top issues, we need people in the group that are interested in working on investigating and bringing information back to the group at a later date. Hopefully at least one person is passionate enough about each topic to look into it further, or else it would not have been voted a top issue. Create a spreadsheet to maintain each topic, driver and the due date they propose to bring information back to the group.

Each of the drivers should report back on these questions to help the organization begin to understand the issue, report back on the answers they’ve acquired and decide on the next steps. This follow up can happen at the beginning of future meetings.

Conclusion

System reviews are just one format that we can use to build communication, respect and trust across our team and organization. The purpose can be to surface possible glitches in the system, but also to achieve alignment on what the most important problems the group should spend their energy solving and to reach clarity around them. They can also be used to feed a topic into another format called the Decisions Roundtable, which is a similar type of meeting used to drive forward a proposal to make a change. Similar to post mortems and retrospectives, system reviews can be used to level up the organization and foster a learning culture. Some topics that we’ve explored in the past have been around how we think about hiring managers, why we deviated to using different tools to plan work, how we document things and where that information should live, clarity around career paths, and how we can address diversity and inclusivity in tech management. System reviews are used to help explore topics that may be difficult for some of the people in the group, but as long as the process is handled sensitively, we all come out with better understanding, more empathy for the experiences of others and a stronger organization as a whole.

1 Comment

Introducing Arbiter: A Utility for Generating Oozie Workflows

Posted by on December 16, 2015 / 4 Comments

At Etsy we have been using Apache Oozie for managing our production workflows on Hadoop for several years. We’ve even recently started using Oozie for managing our ad hoc Hadoop jobs as well. Oozie has worked very well for us and we currently have several dozen distinct workflows running in production. However, writing these workflows by hand has been a pain point for us. To address this, we have created Arbiter, a utility for generating Oozie workflows.

The Problem

Oozie workflows are specified in XML. The Oozie documentation has an extensive overview of writing workflows, but there are a few things that are helpful to know. A workflow begins with the start node:

<start to="fork-2"/>

Each job or other task in the workflow is an action node within a workflow. There are some built-in actions for running MapReduce jobs, standard Java main classes, etc. and you can also define custom action types. This is an example action:

<action name="transactional_lifecycle_email_stats">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>transactional_lifecycle_email_stats.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-2"/>
    <error to="join-2"/>
</action>

Each action defines a transition to take upon success and a (possibly) different transition to take upon failure:

<ok to="join-2"/>
<error to="join-2"/>

To have actions run in parallel, a fork node can be used. All the actions specified in the fork will be run in parallel:

<fork name="fork-2">
    <path start="fork-0"/>
    <path start="transactional_lifecycle_email_stats"/>
</fork>

After these actions there must be a join node to wait for all the forked actions to finish:

<join name="join-2" to="screamapillar"/>

Finally, a workflow ends by transitioning to either the end or kill nodes, for a successful or unsuccessful result, respectively:

<kill name="kill">
    <message>Workflow email-rollups has failed with msg: [${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>

Here is a complete example of one of our shorter workflows:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="email-rollups">
  <start to="fork-2"/>
  <fork name="fork-2">
    <path start="fork-0"/>
    <path start="transactional_lifecycle_email_stats"/>
  </fork>
  <action name="transactional_lifecycle_email_stats">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>transactional_lifecycle_email_stats.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-2"/>
    <error to="join-2"/>
  </action>
  <join name="join-2" to="screamapillar"/>
  <action name="screamapillar">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>${queueName}</value>
        </property>
        <property>
          <name>mapreduce.map.output.compress</name>
          <value>true</value>
        </property>
      </configuration>
      <main-class>com.etsy.oozie.Screamapillar</main-class>
      <arg>--workflow-id</arg>
      <arg>${wf:id()}</arg>
      <arg>--recipient</arg>
      <arg>fake_email</arg>
      <arg>--sender</arg>
      <arg>fake_email</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="end"/>
    <error to="kill"/>
  </action>
  <fork name="fork-0">
    <path start="email_campaign_stats"/>
    <path start="user_language"/>
  </fork>
  <action name="user_language">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>user_language.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-0"/>
    <error to="join-0"/>
  </action>
  <join name="join-0" to="fork-1"/>
  <fork name="fork-1">
    <path start="email_overview"/>
    <path start="trans_email_overview"/>
  </fork>
  <action name="trans_email_overview">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>trans_email_overview.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-1"/>
    <error to="join-1"/>
  </action>
  <join name="join-1" to="join-2"/>
  <action name="email_overview">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>zz_email_overview.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-1"/>
    <error to="join-1"/>
  </action>
  <action name="email_campaign_stats">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>zz_email_campaign_stats.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-0"/>
    <error to="join-0"/>
  </action>
  <kill name="kill">
    <message>Workflow email-rollups has failed with msg: [${wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>
  <end name="end"/>
</workflow-app>

Having the workflows be defined in XML has been very helpful. We have several validation and visualization tools in multiple languages that can parse the XML and produce useful results without being tightly coupled to Oozie itself. However, the XML is not as useful for the people that work with it. First, it is very verbose. Each new action adds about 20 lines of XML to the workflow, much of which is boilerplate. As a result, our workflows average around 200 lines and the largest is almost 1800 lines long. This also makes it hard for someone to read the workflow and understand what the workflow does and the flow of execution.

Next, defining the flow of execution can be tricky. It is natural to think about the dependencies between actions. Oozie workflows, however, are not specified in terms of these dependencies. The workflow author must satisfy these dependencies by configuring the workflow to run the actions in the proper order. For simple workflows this may not be a problem, but can quickly become complex. Moreover, the author must manually manage parallelism by inserting forks and joins. This makes modifying the workflow more complex. We have found that it’s easy to miss adding an action to a fork, resulting in an orphaned action that doesn’t get run. Another common problem we’ve had with forks is that a single-action fork is considered invalid by Oozie, which means removing the second-last action from a fork requires removing the fork and join entirely.

Introducing Arbiter

Arbiter was created to solve these problems. XML is very amenable to being produced automatically, so there is the opportunity to write the workflows in another format and produce the final workflow definition. We considered several options, but ultimately settled on YAML. There are robust YAML parsers in many languages and we considered it easier for people to read than JSON. We also considered a Scala-based DSL, but we wanted to stick with a markup language for language-agnostic parsing.

Writing Workflows

Here is the same example workflow from above written in Arbiter’s YAML format:

---
name: email-rollups
errorHandler:
  name: screamapillar
  type: screamapillar
  recipients: fake_email
  sender: fake_email
actions:
  - name: email_campaign_stats
    type: rollup
    rollup_file: zz_email_campaign_stats.sql
    category: regular
    dependencies: []
  - name: trans_email_overview
    type: rollup
    rollup_file: trans_email_overview.sql
    category: regular
    dependencies: [email_campaign_stats, user_language]
  - name: email_overview
    type: rollup
    rollup_file: zz_email_overview.sql
    category: regular
    dependencies: [email_campaign_stats, user_language]
  - name: user_language
    type: rollup
    rollup_file: user_language.sql
    category: regular
    dependencies: []
  - name: transactional_lifecycle_email_stats
    type: rollup
    rollup_file: transactional_lifecycle_email_stats.sql
    category: regular
    dependencies: []

The translation of the YAML to XML is highly dependent on the configuration given to Arbiter, which we will cover in the next section. However, there are several points to consider now. First, the YAML definition is only about 20% of the length of the XML. Since the workflow definition is much shorter, it’s easier for someone to read it and understand what the workflow does. In addition, none of the flow control nodes need to be manually specified. Arbiter will insert the start, end, and kill nodes in the correct locations. Forks and joins will also be inserted when actions can be run in parallel.

Most importantly, however, the workflow author can directly specify the dependencies between actions, instead of the order of execution. Arbiter will handle ordering the actions in such a way to satisfy all the given dependencies.

In addition to the standard workflow actions, Arbiter allows you to define an “error handler” action. It will automatically insert this action before any transitions to the end or kill nodes in the workflow. We use this to send an email alert with details about the success and failure of the workflow actions. If these are omitted the workflow will transition directly to the end or kill nodes as appropriate.

Configuration

The mapping between a YAML workflow definition and the final XML is controlled by configuration files. These are also specified in YAML. Here is an example configuration file to accompany the example workflow given above:

---
killName: kill
killMessage: "Workflow $$name$$ has failed with msg: [${wf:errorMessage(wf:lastErrorNode())}]"
actionTypes:
  - tag: java
    name: rollup
    configurationPosition: 2
    properties: {"mapreduce.job.queuename": "rollups"}
    defaultArgs: {
      job-tracker: ["${jobTracker}"],
      name-node: ["${nameNode}"],
      main-class: ["com.etsy.db.VerticaRollupRunner"],
      arg: ["--file", "$$rollup_file$$", "--frequency", "daily", "--category", "$$category$$", "--env", "${cluster_env}"]
    }
  - tag: sub-workflow
    name: sub-workflow
    defaultArgs: {
      app-path: ["$$workflowPath$$"],
      propagate-configuration: []
    }
  - tag: java
    name: screamapillar
    configurationPosition: 2
    properties: {"mapreduce.job.queuename": "${queueName}", "mapreduce.map.output.compress": "true"}
    defaultArgs: {
      job-tracker: ["${jobTracker}"],
      name-node: ["${nameNode}"],
      main-class: ["com.etsy.oozie.Screamapillar"],
      arg: ["--workflow-id", "${wf:id()}", "--recipient", "$$recipients$$", "--sender", "$$sender$$", "--env", "${cluster_env}"]
    }

The key part of the configuration file is the actionTypes setting. Each action type will map to a certain action type in the XML workflow. However, multiple Arbiter action types can map to the same Oozie action type, such as the screamapillar and rollup action types both mapping to the Oozie java action type. This allows you to have meaningful action types in the YAML workflow definitions without the overhead of actually creating custom Oozie action types. Let’s review the parts of an action type definition:

  - tag: java
    name: rollup
    configurationPosition: 2
    properties: {"mapreduce.job.queuename": "rollups"}
    defaultArgs: {
      job-tracker: ["${jobTracker}"],
      name-node: ["${nameNode}"],
      main-class: ["com.etsy.db.VerticaRollupRunner"],
      arg: ["--file", "$$rollup_file$$", "--frequency", "daily", "--category", "$$category$$", "--env", "${cluster_env}"]
    }

The tag key defines the action type tag in the workflow XML. This can be one of the built-in action types like java, or a custom Oozie action type. Arbiter does not need to be made aware of custom Oozie action types. The name key defines the name of this action type, which will be used to set the type of actions in the workflow definition. If the Oozie action type accepts configuration properties from the workflow XML, these are controlled by the configurationPosition and properties keys. properties defines the actual configuration properties that will be applied to every action of this type, and configurationPosition defines where in the generated XML for the action the configuration tag should be placed. The defaultArgs key defines the default elements of the generated XML for actions of this type. The keys are the names of the XML tags, and the values are lists of the values for that tag. Even tags that can appear only once must be specified as a list.

You can also define properties to be populated from values set in the workflow definition. Any string surrounded by $$ will be interpolated in this way. $$rollup_file$$ and $$category$$ are examples of doing so in this configuration file. These will be populated with the values of the rollup_file and category keys from a rollup action in the workflow definition.

Using this configuration file, we could write an action like the following in the YAML workflow definition:

  - name: email_campaign_stats
    type: rollup
    rollup_file: zz_email_campaign_stats.sql
    category: regular
    dependencies: []

Arbiter would then translate this action to the following XML:

<action name="email_campaign_stats">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>zz_email_campaign_stats.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-0"/>
    <error to="join-0"/>
</action>

Arbiter also allows you to specify the name of the kill node and the message it logs with the killName and killMessage properties.

How Arbiter Generates Workflows

Arbiter builds a directed graph of all the actions from the workflow definition it is processing. The vertices of the graph are the actions and the edges are dependencies. The direction of the edge is from the dependency to the dependent action to represent the desired flow of execution. Oozie workflows are required to be acyclic, so if a cycle is detected Arbiter will throw an exception.

The directed graph that Arbiter builds will be made up of one or more weakly connected components. This is the graph from the example workflow above, which has two such components:
An example of a graph that is input to Arbiter
Each of these components is processed independently. First, any vertices with no incoming edges are removed from the graph and inserted into a new result graph. If there is more than one vertex removed Arbiter will also insert a fork/join pair to run them in parallel. Having removed those vertices, the original component will now have been split into one or more new weakly connected components. Each of these components is then recursively processed in this same way.

Once every component has been processed, Arbiter then combines these independent components until it has produced a complete graph. Since these components were initially not connected, they can be run in parallel. If there is more than component, Arbiter will insert a fork/join pair. This results in the following graph for the example workflow, showing the ok transitions between nodes:

An example workflow graph produced by Arbiter
This algorithm biases Arbiter towards satisfying the dependencies between actions over achieving optimal parallelism. In general this algorithm still produces good parallelism, but in certain cases (such as a workflow with one action that depends on every other action), it can degenerate to a fairly linear flow. While it is a conservative choice, this algorithm has still worked out well for most of our workflows and has the advantage of being straightforward to follow in case the generated workflow is incorrect or unusual.

Once this process has finished all the flow control nodes will be present in the workflow graph. Arbiter can then translate this into the XML using the provided configuration files.

Get Arbiter

Arbiter is now available on Github! We’ve been using Arbiter internally already and it’s been very useful for us. If you’re using Oozie we hope Arbiter will be similarly useful for you and welcome any feedback or contributions you have!

4 Comments

Crunching Apple Pay tokens in PHP

Posted by on November 20, 2015 / No Responses

Etsy is always trying to make it easier for members to buy and sell unique goods. And with 60% of our traffic now coming from mobile devices, making it easier to buy things on phones and tablets is a top priority. So when Apple Pay launched last year, we knew right away we wanted to offer it for our iOS users, and shipped it in April. Today we’re open sourcing part of our server-side solution, applepay-php, a PHP extension that verifies and decrypts Apple Pay payment tokens.

Integrating with Apple Pay comes down to two main areas of development: device side and payment-processing side. On the device side, at a high level, your app uses the PassKit framework to obtain an encrypted payment token which represents a user’s credit card info. On the payment-processing side, the goal is to make funds move between bank accounts. The first step here is to decrypt the payment token.

Many payment processors offer APIs to decrypt Apple Pay tokens on your behalf, but in our case, we wanted the flexibility of reading the tokens in-house. It turns out that doing this properly is pretty involved (to get an idea of the complexity, our solution defines 63 unique error codes), so we set out to find a pre-existing solution. Our search yielded a couple of open source projects, but none that fully complied with Apple’s spec. Notably, we couldn’t find any examples of verifying the chain of trust between Apple’s root CA and the payment signature, a critical component in guarding against forged payment tokens. We also couldn’t find any examples written in PHP (our primary language) or C (which could serve as the basis for a PHP extension). To meet our needs, we wrote a custom PHP extension on top of OpenSSL that exposes just two functions: applepay_verify_and_decrypt and applepay_last_error. This solution has worked really well for us over the past six months, so we figured we’d share it to make life easier for anyone else in a similar position.

Before releasing the code, we asked Syndis, a security consultancy based out of Iceland, to perform an external code review in addition to our everyday in-house code reviews. Syndis surveyed the code for both design flaws and implementation flaws. They found a few minor bugs but no actual vulnerabilities. Knowing that we wouldn’t be exposing users to undue risk gave us greater confidence to publish the code.

We’ve committed to using the open source version internally to avoid divergence, so expect to see future development on Github. Future work includes splitting off a generalized libapplepay (making it easier to write wrapper libraries for other languages), PHP7 compatibility, and an HHVM port. (By the way, if any of this sounds fun to you, we’d love for you to come work with us.)

We hope this release provides merchants with a solid solution for handling Apple Pay tokens. We also hope it inspires other organizations to consider open sourcing parts of their payment infrastructure.

You can follow Adam on Github @adsr.

Special thanks to Stephen Buckley, Keyur Govande, and Rasmus Lerdorf.

No Comments

Q3 2015 Site Performance Report

Posted by on November 10, 2015 / No Responses

Sadly, the summer has come to an end here in Brooklyn, but the changing of the leaves signifies one thing—it’s time to release our Q3 site performance report! For this report, we’ve collected data from a full week in September that we will be comparing to a full week of data from May. Similar to last quarter’s report, we will be using box plots to better visualize the data and the changes we’ve seen.

While we love to share stories of our wins, we find it equally important to report on the challenges we face. The prevailing pattern you will notice across all sections of this report is increased latency. Kristyn Reith will provide an update on backend server-side performance and Mike Adler, one of the newest members to the Performance team, will be reporting the synthetic frontend and the real user monitoring sections of this report.

Server-Side Performance

The server-side data below reflects the time seen by real users, both signed-in and signed-out. As a reminder, we are randomly sampling our data for all pages during the specified weeks in each quarter.

server-side
You can see that with the exception of the homepage, all of our pages have gotten slower on the backend. The performance team kicked off this quarter by hosting a post mortem for a site-wide performance degradation that occurred at the end of Q2. At that time, we had migrated a portion of our web servers to new, faster hardware, however the way the workload was initially distributed was overworking the old hardware, leading to poor performance for the 95th percentile. Increasing the weighting of the new hardware in the loadbalancer helped mitigate this. While medians did not see a significant impact over the course of the hardware change, it caused higher highs and lower lows for the 95th percentile. As a heavier page, the signed-in homepage saw the greatest improvement once the weights were adjusted, which contributed to its overall improvement this quarter. Other significant causes for the changes seen on the server side can be attributed to two new initiatives that were launched this quarter, Project Arizona and Category Navigation.

Arizona is a read-only key / value system to serve product recommendations and other generated datasets on a massive scale. It replaces a previous system that we had outgrown that stored all data in-memory; Arizona instead uses SSDs to allow for more and varied datasets. This quarter we launched the first phase of the project that resulted in some expected performance regressions compared with the previous memory-backed system. The first phase focused on correctness, ensuring data remained consistent between the two systems. Future phases will focus on optimizing speed of lookups to be comparable to the previous system while offering much greater scalability and availability.

In the beginning of August, our checkout team noticed two separate regressions on the cart page that had occurred over the course of the prior month. We had not been alerted on these slowdowns because at the end of Q2, the checkout team had launched cart pagination which improved the performance of the cart page by limiting the number of items loaded and we had not adjusted the thresholds to match this new normal. Luckily, the checkout team noticed the change in performance and we were able to trace the cause back to testing for Arizona.

While in the midst of testing for Arizona, we also launched a new site navigation bar that is included under the search bar on every page and features eight of the main shopping categories. Not only does the navigation bar make it easier for shoppers to find items on the site, but we also believe that the new navigation will positively affect Search Engine Optimization, driving more traffic to shops. While testing the feature we noticed some performance impacts so when the feature launched at the end of August, we were closely watching as we expected a performance degradation due to the amount of the HTML being generated. The performance impact was felt across the majority of our pages though it was more noticeable on some pages than others depending on the weight of the page. For example, lighter pages such as baseline appear harder hit because the navigation bar accounts for a significant amount of the page’s overall weight.

In an awesome win, in response to the anticipated performance hit, the buyer experience engineering team ramped up client side rendering for this new feature, which cut down the rendering time on buyer side pages by caching the HTML output and shipping less to the client.

In addition to the hardware change, Project Arizona and the new site navigation feature, we also have been investigating a slow, gradual regression we noticed across several pages that began in the first half of Q3. Extensive investigation and testing revealed that the regression was the result of limited CPU resources. We are currently adding additional CPU capacity and anticipate the affected pages will get faster in this current quarter.

Synthetic Start Render

Let’s move on to our synthetic tests where we have instrumented browsers load pages automatically every 10 minutes from several locations. This expands the scope of analysis to include browser-side measurements along with server-side. The strength of synthetic measurements is that we can get consistent, highly-detailed metrics about typical browser scenarios. We can look at “start render” to estimate when most people first see our pages loading.

Synthetic Start Render
The predominant observation is that our median render-start times across most pages has increased about 300ms compared to last quarter. You might expect a performance team to feel bummed out about a distinctly slower result, but we actually care about more about the overall user experience than just page speed measurements on any given week. The goal of our Performance team is not just to make a fast site, but to encourage discussions that accurately consider performance as one important concern among several.

This particular slowdown was caused by broader use of our new css toolkit, which adds 35k of CSS to every page. We expect the toolkit to be a net-win eventually, but we have to pay a temporary penalty while we work on eliminating non-standard styles. Several teams gathered together to discuss the impact of this change, which gave us confidence that Etsy’s culture of performance is continuing to mature, despite this particular batch of measurements.

The median render-start time for our search page appears to have increased by 800ms, following a similar degradation in the last quarter, but we found this to be misleading. We isolated this problem to IE browsers versions 10 and older, which actually represents a tiny fraction of Etsy users. The search page renders much faster (around 1100ms) in Chrome (far more popular), which is consistent with all our other pages across IE and Chrome.

Synthetic checks are vulnerable to this type of misleading measurement because it’s really difficult to build comprehensive labs that match the true diversity of browsers in the wild. RUM measurements are better suited to that task. We are currently discussing how to improve the browsers we use in our synthetic tests.

What was once a convenient metric for estimating experience may eventually become less meaningful as one fundamentally changes the way a site is loaded. We feel it is important to adapt our monitoring to the new realities of our product. We always want to be aligned with our product teams, helping them build the best experience, rather than spending precious time optimizing for metrics that were more useful in the past.

As it happens, we recently made a few product improvements around site navigation (mentioned in the above section). As we optimized the new version, we focused on end-user experience and it became clear that ‘Webpage Response’ was becoming less and less connected to end-user experience. WR includes the time for ALL assets loaded on the page, even if these requests are hidden from the end-user, such as deferred beacons.

We are evaluating alternative ways to estimate end-user experience in the future.

Real User Page Load Time

Real user monitoring give us insight into actual page loads experienced by end-users. Notably, it accounts for real-world diversity of network conditions, browser versions, and internationalization.

RUM
We can see across-the-board increases, which is in line with our other types of measurements. By looking at the daily summaries of these numbers, we confirmed that the RUM metrics regressed when we launched our revamped site navigation (first mentioned in the server-side section). Engineers at Etsy worked to optimize this feature over the next couple weeks and made progress, though one optimization ended up causing a regression on some browsers. This was not exposed except in our RUM data. We have a plan to speed this up during the fourth quarter.

Conclusion

In the third quarter, we had our ups and downs with site performance, due to both product and infrastructure changes. It is important to remember that performance cannot be reduced merely to page speed; it is a balancing act of many factors. Performance is a piece of the overall user experience and we are constantly improving our ability to evaluate performance and make wiser trade-offs to build the best experience. The slowdowns we saw this quarter have only reinforced our commitment to helping our engineering teams monitor and understand the impact of the new features and infrastructure changes they implement. We have several great optimizations and tools in the pipeline and we look forward to sharing the impact of these in the next report.

No Comments

Managing Hadoop Job Submission to Multiple Clusters

Posted by on September 24, 2015 / 4 Comments

At Etsy we have been running a Hadoop cluster in our datacenter since 2012.  This cluster handled both our scheduled production jobs as well as all ad hoc jobs.  After several years of running our entire workload on this one production Hadoop cluster, we recently built a second.  This has greatly expanded our capacity and ability to manage production and ad hoc workloads, and we got to have fun coming up with names for them (we settled on Pug and Basset!).  However, having more than one cluster has brought new challenges.  One of the more interesting issues that came up was how to manage the submission of ad hoc jobs with multiple clusters.

The Problem

As part of building out our second cluster we decided to split our current workload between the two clusters.  Our initial plan was to divide the Hadoop workload by having all scheduled production jobs run on one cluster and all ad hoc jobs on the other.  However, we recognized that those roles would change over time.  First, if there were an outage or we were performing maintenance on one of the clusters, we may shift all the workload to the other.  Also, as our workload changes or we introduce new technology, we may balance the workload differently between the two clusters.

When we had only one Hadoop cluster, users of Hadoop would not have to think about where to run their jobs.  Our goal was to keep it easy to run an ad hoc job without users needing to continually keep abreast of changes in which cluster to use.  The major obstacle for this goal is that all Hadoop users submit jobs from their developer VMs.  This means we would have to ensure that the changes necessary to switch which cluster should be used for ad hoc jobs propagate to all of the VMs in a timely fashion.  Otherwise some users would still be submitting their jobs to the wrong cluster, which could mean those jobs would fail or otherwise be disrupted. To simplify this and avoid such issues, we wanted a centralized mechanism for determining which cluster to use.

Other Issues

There were two related issues that we decided to address at the same time as managing the submission of ad hoc jobs to the correct cluster.  First, we wanted the cluster administrators to have the ability to disable ad hoc job submission entirely.  Previously we had relied on asking users via email and IRC to not submit jobs, which is only effective if everyone checks and sees that request before launching a job.  We wanted a more robust mechanism that would truly prevent running ad hoc jobs.  Also, we wanted a centralized location to view the client-side logs from running ad hoc jobs.  These would normally only be available in the user’s terminal, which complicates sharing these logs when getting help with debugging a problem.  We wanted both of these features regardless of having the second Hadoop cluster.  However, as we considered various approaches for managing ad hoc job submission to multiple clusters, we found that we could solve these problems at the same time.

Our Approach

We chose to use Apache Oozie to manage ad hoc job submission.  Using Oozie had several significant advantages for us.  First, we already were using Oozie for all of our scheduled production workflows.  As such we already understood it well and had it properly operationalized.  It also allowed us to reuse existing infrastructure rather than setting up something new, which greatly reduced the time and effort necessary to complete this project. Next, using Oozie let us distribute the load from the job client processes across the Hadoop cluster.  When ad hoc job submission occurred on users’ VMs, this load was naturally distributed.  Distributing this load across the Hadoop cluster allows this approach to grow with the cluster.  Moreover, using Oozie automatically provided a central location for viewing the client logs from job submission.  Since the clients run on the Hadoop cluster, their logs are available just like the logs from any other Hadoop job.  As such they can be shared and examined without needing to retrieve them from the user’s terminal.

There was one downside to using Oozie: it did not support automatically directing ad hoc jobs to the appropriate cluster or disabling the submission of ad hoc jobs.  We had to build this ourselves, but as Oozie was handling everything else it was very lightweight.  To minimize the amount of new infrastructure for this component, we used our existing internal API framework to manage this state.  We call this component the “state service”.

The Job Submission Process

Previously the process of submitting an ad hoc job looked like this:Original Job Submission Sequence Diagram
Now submitting an ad hoc job looks like this instead:

 

Job Submission Server Sequence Diagram

 

From the perspective of users nothing had changed; they would still launch jobs using our run_scalding script on their VM.  Internally, it would request the active ad hoc cluster using the API for the state service.  This API call would also indicate if ad hoc job submission was disabled, allowing the script to terminate.  Administrators can also set a message that would be displayed to users when this happens, which we use to provide information about why ad hoc jobs were disabled and the ETA on re-enabling them.

Once the script determined the cluster on which the job should run, it would generate an Oozie workflow from a template that would run the user’s job.  This occurs transparently to the user so that they do not have to be concerned about the details of the workflow definition.  The script then submits this generated workflow to Oozie, and the job runs.  The change most visible to users in this process is that the client logs no longer appear in their terminal as the job executes.  We considered trying to stream them from the cluster during execution, but to minimize complexity the script prints a link to the logs on the cluster after the job completes.

Other Options

While using Oozie ended up being the best choice for us, there were several other approaches we considered.

Apache Knox

Apache Knox is a gateway for Hadoop REST APIs.  The project primarily focuses on security, so it’s not an immediate drop-in solution for this problem.  However, it provides a gateway, similar to a reverse proxy, that maps externally exposed URLs to the URLs exposed by the actual Hadoop clusters.  We could have used this functionality to define URLs for an “ad hoc” cluster and change the Knox configuration to point that to the appropriate cluster.

Nevertheless, we felt Knox was not a good choice for this problem.  Knox is a complex project with a lot of features, but we would have been using only a small subset of these.  Furthermore, we would be using it outside of its intended use case, which could complicate applying it to solve our problem.  Since we did not have experience operating Knox at scale, we felt it would be better to stick with Oozie, which we already understood and would not have to shoehorn into our use case.

Custom Job Submission Server

We also considered implementing our own custom process to both manage the state of which cluster was currently active for ad hoc jobs as well as handling centralized job submission.  While this would have provided the most flexibility, it also meant building a lot of new infrastructure.  We would have essentially been reimplementing Oozie, but without any of the community testing or support.  Since we were already using Oozie and it met all our requirements, there was no need to build something custom.

Gateway Server

The final approach we considered was having a “gateway server” and requiring users to SSH to that server and launch jobs from there instead of from their VM.  This would have simplified the infrastructure components for job submission.  The Hadoop configuration changes to point ad hoc job submissions to the appropriate cluster or disable job submission entirely would only need to be deployed there.  By its very nature it would provide a central location for the client logs.  However, we would have to manage scaling and load balancing for this approach ourselves.  Furthermore, it would represent a significant departure from how development is normally done at Etsy.  Allowing users to write and run Hadoop jobs from their VM is important for keeping Hadoop as accessible as possible.  Adding the additional step of moving changes and SSH-ing to a gateway server compromises that goal.

Conclusion

Using Oozie to manage ad hoc job submission in this way has worked well for us.  Reusing the Oozie infrastructure we already had let us quickly build this out, and having this new process for running jobs made the transition to having two Hadoop clusters much easier.  Moreover, we were able to keep the process of submitting an ad hoc job almost identical to the previous process, which minimized the disruption for users of Hadoop.

As we were developing this, we found that there was only minimal discussion online about how other organizations have managed ad hoc job submission with multiple clusters.  Our hope is that this review of our approach as well as other options we considered is helpful if you are in the same situation and are looking for ideas for your own process of ad hoc job submission.

4 Comments

Assisted Serendipity – Fostering Peer to Peer Connections in Your Organization

Posted by on September 15, 2015 / 3 Comments

It happens at every growing company – one day you pass someone in the hallway of your office and have no idea whether they work with you, or if they’re just visiting your office. You used to know just about everyone at your company, but you’re growing so fast and hiring so quickly that it’s hard to keep up.  Even the most extroverted of us have a hard time learning everyone’s name when offices start expanding to different floors, different states, and even different countries.

One way to combat this problem is to give employees a means of being randomly introduced to each other.  We’ve already written a bit about culture hacking using a staff database, and the tool we’re open sourcing today takes advantage of this employee data that we make available within the company. The tool that we’re releasing is called Mixer. It’s a simple web app that allows people to join a group and then get randomly paired with another member of that group. It then prompts you to meet each other for a coffee, lunch, or a drink after work.  If the person you get paired up with is working remotely, that’s not a problem — just hop on a video chat.  This encourages people who may not work in the same place to stay in touch and find out what’s going on in each other’s day to day.  The tool keeps a history of the pairings and attempts to match you with someone unique each week; it’s possible to opt in or out of the program at any time.

mixer-app mixer-email

A lot of managers believe in the value of regular one-on-one meetings with their reports, but it is less common to do so with peers. At Etsy, these meetings between peers have resulted in cross-departmental partnerships that might not have otherwise surfaced, on top of providing an avenue of support for folks to work through difficult situations. These conversations also generally strengthen our culture by introducing people to their co-workers. Other benefits include learning more about what others are working on, brainstorming new collaborative projects that utilize strengths from a diverse set of core skillsets, and getting help with a challenge from someone who is distanced from the situation. Mixer meetings both introduce people who have never met and give folks who know each other a chance to connect in a way they might not have otherwise made time for.

As your company grows, it’s important to facilitate the person-to-person connections that happened naturally when everyone fit in the same small room. These interactions create the fabric of your company’s community and are crucial opportunities for building culture and fostering innovation. Our hope is that the Mixer tool can help you scale those genuine connections as you continue to see new faces in the hallway.

Find the Mixer code on Github

3 Comments

How Etsy Uses Thermodynamics to Help You Search for “Geeky”

Posted by on August 31, 2015 / 7 Comments

Etsy shoppers love the large and diverse selection of our marketplace. But, for those who don’t know exactly what they’re looking for, the sheer number and variety of items available can be more frustrating than delightful. In July, we introduced a new user interface which surfaces the top categories for a search request to help users explore the results for queries like “gift.” Searchers who issue broad queries like this often don’t have a specific type of item in mind, and are especially likely to finish their visit empty-handed. Our team lead, Gio, described our motivations and process in an (excellent) blog post last month, which gives more background on the project. In this post, I’ll focus on how we developed and iterated on our heuristic for classifying queries as “broad.”

Our navigable interface, shown for a query for “geeky gift”

Our navigable interface, shown for a query for “geeky gift”

Quantifying “broadness”

When I describe what I’m working on to people outside the team, they often jump in with a guess about how we use machine learning techniques to determine which queries are broad in code. While we could have used complex, offline signals like click or purchasing behavior to learn which queries should trigger the category interface, we actually base the decision on a single calculation, evaluated entirely at runtime, which uses very basic statistics about the search result set.

There have been several advantages to sticking with a simpler metric. By avoiding query-specific behavioral signals, our approach works for all languages and long-tail queries out of the gate. It’s performant and relatively easy to debug. It’s also (knock on wood) stable and easy to maintain, with very few external dependencies or moving parts. I’ll explain how we do it, and arguably justify the title of this post in the process.

Let’s take “geeky” as an example of a broad query, one that tells us very little about what type of item the user is looking for. Jewelry is the top category for “geeky,” but there are many items in all of the top-level categories.Top Categories for "Geeky" by Result Count

Compare to the distribution of results for “geeky mug,” which are predictably concentrated in the Home & Living category.

Top Categories for "Geeky Mug" by Result Count

In plainspeak, the calculation we use measures how spread out across the marketplace the items returned for the query are. The distribution of results for “geeky” suggests that the user might benefit from seeing the top categories, which demonstrate the breadth of geeky paraphernalia available on the site, from a periodic table-patterned bow tie to a “cutie pi” mug. The distribution for “geeky mug” is dominated by one category, and shouldn’t trigger the category interface.

The categories shown for a query for “geeky”

The categories shown for a query for “geeky”

Doing the math

In order to quantify how “spread out” items are, we start by taking the number of results returned for the query in each of the top-level categories and deriving the probability that an item is in each category. Since 20% of the items returned are in the Jewelry category and 15% of items are in the Accessories category, the probability values for Jewelry and Accessories would be .2 and .15 respectively. We use these values as the inputs to the Shannon entropy formula:

Shannon entropy formula

Shannon entropy formula

This formula is a measure of the disorder of a probability distribution. It’s essentially equivalent to the formula used to calculate the thermodynamic entropy of a physical system, which models a similar concept.

For our purposes, let rt be the total number of results and ri be the number of results for category i. Then the probability value in the above equation would be (ri /rt) and the entropy of the distribution of a search result set across its categories can be expressed as:

Search result entropy

Entropy of a search result set

In this way, we can determine when to show categories without using any offline signals. This is not to say that we didn’t use data in our development process at all. To determine the entropy threshold above which we should show categories, we looked at the entropies for a large sample of queries and made a fairly liberal judgement call on a good dividing line (i.e. a low threshold). Once we had results from an AB experiment which showed the new interface to real users, we looked to see how it affected user behavior for queries with lower entropy levels, and refined the cut-off based on the numbers. But this was a one-off analysis; we expect the threshold to be static over time, since the distribution of our marketplace across categories changes slowly.

Taking it to the next level

A broad query may not necessarily have high entropy at the top level of our taxonomy. Results for “geeky jewelry” are unsurprisingly concentrated in our Jewelry category, but there are still many types of items that are returned. We’d like to guide users into more specific subcategories, like Earrings and Necklaces, so we introduced a secondary round of entropy calculations for queries that don’t qualify as broad at the top level. It works like this: if the result set does not have sufficiently high entropy to trigger the categories at the top level, we determine the entropy within the most populous category (i.e. the entropy of its subcategories) and show those subcategories if that value exceeds our entropy cut-off.

Top Subcategories for "Jewelry" By Count

The graph above demonstrates the level of spread of results for the query “jewelry” across subcategories of the top-level Jewelry category. This method allows us to dive into top-level categories in cases like this, while sticking to a simple runtime decision based on category counts.

Showing subcategories for a query for “geek jewelry”

Showing subcategories for a query for “geek jewelry”

Iterating on entropy

While we were testing this approach, we noticed that a query like “shoes,” which we hoped would be high entropy within the Shoes category, was actually high entropy at the top level.

Top-level categories for the “shoes” query

Top-level categories for the “shoes” query… doesn’t seem quite right

Items returned for “shoes” are apparently sufficiently spread across the whole marketplace to trigger top-level groups, although there are an unusually high number of items in the Shoes category.

Top Categories for "Shoes" by Result Count

More generally, items in our marketplace tend to be concentrated in the most popular categories. A result set is likely to have many more Accessories items than Shoes items, because the former category is an order of magnitude larger than the latter. We want to be able to compensate for this uneven global distribution of items when we calculate the probabilities that we use in our entropy calculation.

By dividing the number of items in each category that are returned for the active search by the total number of items in that category, we get a number we can think of as the affinity between the category and the search query. Although fewer than 50% of the results that come back for a query for “shoes” are in the Shoes category, 100% of items in the Shoes category are returned for a query for “shoes,” so its category affinity is much higher than its raw share of the result set.

Top Categories for "Shoes" by Affinity

Normalizing the affinity values so they sum to one, we use these measurements as the inputs to the same Shannon entropy formula that we used in the first iteration. The normalization step ensures that we can compare entropy values across search result sets of different sizes. Letting ri represent the number of items in category i for the active search query, and ti represent the total number of items in that category, the affinity value for category i, ai, is simply (ri / ti). Taking s as the sum of all affinity values a0…ai, then, the affinity-based entropy is:

Affinity-based entropy of a search result set

Affinity-based entropy of a search result set

From a Bayesian perspective, both the original result count-based values and the affinity values calculate the probability that a listing is in a category given that it is returned for the search query. The difference is that the affinity formulation corresponds to a flat prior distribution of categories whereas the original formulation corresponds to the observed category distribution of items in our marketplace. By controlling for the uneven distribution of items across categories on Etsy, affinity-based entropy fixed our “shoes” problem, and improved the quality of our system in general.

Refining by recipient on a query for “geeky shoes”

Refining by recipient on a query for “geeky shoes”

Keeping it simple

Although our iterations on entropy have introduced more complexity than we had at the outset, we still reap the benefits of avoiding opaque offline computations and dependencies on external infrastructure. Big data signals can be incredibly powerful, but they introduce architectural costs that it turns out aren’t necessary for a functional broad query classifier.

On the user-facing level, making Etsy easier to explore is something I’ve wanted to work on since before I started working here many years ago. It’s very frustrating for searchers to navigate through the millions of items of all types that we return for many popular queries. If you’ll indulge my thermodynamics metaphor once more, by helping to guide users out of high-entropy result sets, we’re battling the heat death of Etsy search—and that’s literally pretty cool.

 

Couldn’t stomach that “heat death” joke? Leave a comment or let me know on twitter.

Huge thanks due to Giovanni Fernandez-Kincade, Stan Rozenraukh, Jaime Delanghe and Rob Hall.

7 Comments

Targeting Broad Queries in Search

Posted by on July 29, 2015 / 5 Comments

We’ve just launched some big improvements to the treatment of broad queries like “father’s day,” “upcycled,” or “boho chic” on Etsy. This is the most dramatic change to the search experience since our switch to relevance by default in 2011. In this post we’d like to give you an introduction to the product and its development process. We think it’s a great example of the values that are at the heart of product engineering at Etsy: leveraging simple techniques, building iteratively, and understanding impact.

Motivations

Before we make a big investment in an idea, we like to spend some time investigating whether or not that idea represents a reasonable opportunity. The opportunity at the heart of this project is exploratory queries like “silver jewelry” where users don’t have something particular in mind. There are 2.7 MM results for “silver jewelry” on Etsy today. No matter how good we get at ranking results, the universe of silver jewelry is simply so vast that the chances that we will show you something you like are pretty slim.

How big of an opportunity is improving the experience for broad queries? How do we even define a broad query?

That’s a really difficult question. Going through this exercise can easily turn into doing the hardest parts of the “real work.” Instead of doing something clever, we time-boxed our analysis and looked at a handful of heuristics for different levels of user intent. Here’s a sample:

  1. Number of Tokens
  2. Result Set Size
  3. Number of Distinct Categories Represented in the Results

For each heuristic, we looked at the distribution across a week’s worth of search queries, and chose a threshold that generally separated the broad from the specific queries.

Classifier

We looked at the size of that population and their engagement rates (the green arrow is our target audience):

Click Rate and Population by Search Tokens

None of the heuristics were independently sufficient, but by looking at several we were able to generate a rough estimate: it turns out that a sizable portion of searches on Etsy are broad queries. That matches our intuitions. Etsy is a marketplace of unique goods so it’s hard for consumers to know precisely what to look for.

Having some evidence that this was a worthwhile endeavor, we packed our bags and set off to meet the wizard.

Crafting an Experience

What can we do to improve the experience for users that issue a broad query? What about grouping the results into discrete buckets so users can get a better sense of what types of things are present? Grouping items into their respective categories seemed like an obvious starting place, but we could also group the items by any number of dimensions like style, color, and material.

We started with a few quick-and-dirty iterations of design and user-testing. Our designer fashioned a ton of static mocks that he turned into clickable prototypes using Flinto:

Mocks

We followed this up with an unshippable prototype of result grouping on mobile web. We did the simplest possible thing: always show result groupings, regardless of how specific the query is. We even simulated a native version using JPEG technology:

Jpeg Tech

People responded really well to these treatments. Many even expressed a desire for the feature before they saw it: “I wish I could just see what types of jewelry there are.”

But the user tests also made it painfully clear how problematic false positives (showing groups when search is definitely not broad) were. There were moments of frustration where users clearly just wanted to see some results and the groups were getting in the way.

On the other hand, showing too many groups didn’t seem as costly. If random or questionably relevant groups appeared towards the end of the list, users often thought they were interesting  or highlighted what made Etsy unique (“I didn’t know you had those!”), adding a serendipitous flavor to the experience.

What’s a broad query?

Armed with a binder full of reasonable UX treatments, it was time to start tackling the algorithmic challenge. The heuristics we used at the beginning of this journey were sufficient for ballpark estimation, but they were fairly imprecise and it was clear that minimizing false positives was a priority.

We quickly settled on using entropy, which you can think of as a measure of the uncertainty in a probability distribution. In this case, we’re looking at the probability that a result belongs to a particular category.

Probability of Jewelry

As the probabilities get more concentrated around a handful of categories, the entropy approaches zero. For example, this is the probability distribution for the query “shoes” amidst the top-level categories:

Shoes

As the distribution gets more dispersed, entropy increases. Here is the same distribution for “father’s day”:

Father's Day

We looked at samples of queries at different entropy levels to manually decide on a reasonable threshold.

Brilliant.

Could we have trained a more sophisticated model with some supervised learning algorithms? Probably, but there are a host of challenges with that approach: getting hand-labeled data or dealing with the noise of using behavioral signals for training data, data sparsity/coverage, etc. Ultimately, we already had what we thought was the most discriminating factor, the resulting algorithm had an intuitive explanation that was easy to reason about, and we felt confident that it would scale to cover the long tail.

Conclusions and Coming Next

After a series of A/B experiments, we’re happy to report that result grouping has resulted in a dramatic increase in user engagement and we’re launching it. But this is only the beginning for this feature and for this story.

Henceforth, result grouping will be another lever in the search product toolbox. The work that we’ve been doing for the past year has really been about building a foundation. We’re going to be aggressively iterating on offline evaluation, new treatments, new grouping dimensions,  classification algorithms, and group ordering strategies. We’re in this for the long haul and we’re excited about the many doors this work has opened for us.

I hope this post gave you a taste for what went into this effort. In the coming months, we’re going to have many members of the Etsy Search family diving deeper into some of the meatier details on subjects like result grouping performance, iterating on the entropy-based algorithm, and how our new product categories laid the groundwork for these improvements.

Oh yeah, and we’re hiring.

5 Comments

Q2 2015 Site Performance Report

Posted by on July 13, 2015 / 4 Comments

We are kicking off the third quarter of 2015, which means it’s time to update you on how Etsy’s performance changed in Q2. Like in our last report, we’ve taken data from across an entire week in May and are comparing it with the data from an entire week in March. We’ve mixed things up in this report to better visualize our data and the changes in site speed:

As in the past, we’ve split up the sections of this report among members of our performance team. Allison McKnight will be reporting on the server-side portion, Kristyn Reith will be covering the synthetic front-end section and Natalya Hoota will be providing an update on the real user monitoring section. We have to give a special shout out to our bootcamper Emily Smith, who spent a week working with us and digging into the synthetic changes that we saw. So without further ado, let’s take a look at the numbers.

Server-Side Performance

Serverside_HomeListingShopBaseline

Taking a look at our backend performance, we see that the quartile boundaries for home, listing, shop, and baseline pages haven’t changed much between Q1 and Q2. We see a change in the outliers for the shop and baseline pages – the outliers are more spread out (and the largest outlier is higher) in this quarter compared to the last quarter. For this report, we are going to focus on analyzing only changes in the quartile boundaries while we work on honing our outlier analysis skills and tools for future reports.

Serverside_cart

On the cart page, we see the top whisker and outliers move down. During the week in May when we pulled this data, we were running an experiment that added pagination to the cart. Some users have many items in their carts; these items take a long time to load on the backend. By limiting the number of items that we load on each cart page, we improve the backend load time for these users especially. If we were to look at the visit data in another format, we might see a bimodal distribution where users exposed to this experiment would have clearly different performance than users who didn’t see the experiment. Unfortunately, box plots limit our view on whether user experience could be statistically divided into two separate categories (i.e. multimodal distribution). We’re happy to say that we launched this feature in full earlier this week!

Serverside_search

This quarter, the Search team experimented with new infrastructure that should make desktop and mobile experience more streamlined. On the backend, this translated into a slightly higher median time with an improvement for the slower end of users: the top whisker moved down from 511 ms to 447 ms, and the outliers moved down with it. The bottom whisker and the third quartile also moved down slightly while the first quartile moved up.

Taking a look at our timeseries record of search performance across the quarter, we see that a change was made that greatly impacted slower loads and had a smaller impact on median loads:

Serverside_search2

Synthetic Start Render and Webpage Response

Most things look very stable quarter over quarter for synthetic measurements of our site’s performance.

Synthetic_baselinehomecart

As we only started our synthetic measurements for the cart page in May, we do not have quarter-over-quarter data.

Synthetic_search

You can see that the start render time of the search page has gotten slower this quarter but that the webpage response time for search sped up. The regression in start render was caused by experiments being run by our search team, while the improvement in the webpage response time for search resulted from the implementation of the Etsy styleguide toolkit. The toolkit is a set of fully responsive components and utility classes that make layout fast and consistent. Switching to the new toolkit decreased the amount of custom CSS that we deliver on search pages by 85%.

Synthetic_listingshop

As noted above, we are using a slightly different date range for the listing and shop data so that we can compare apples to apples. Taking a look at the webpage response time box plots, we see improvements to both the listing and shop pages. The faster webpage response time for the listing page can be attributed to an experiment running that reduced the page weight by altering the font-weights. The improvement to shop’s webpage response time is the result of migrating to a new tag manager that is used to track the performance of outside advertising campaigns. This migration allowed us to fully integrate third party platforms in new master tags which reduced the number of JS files for campaigns.

Real User Page Load Time

The software we use to measure our real user measurements, mPulse, was updated in the middle of this quarter, leading to a number of improvements in timer calculation and data collection and validation. Expectedly, we saw a much more comprehensive pattern in data outliers (i.e., values falling far above and below the average) on all pages, and are excited for this cleaner data set.

RealUserQ2

Since Q1 and Q2 data was collected with different versions of the real user monitoring software, it would not be scientifically accurate to make any conclusions about our user experiences this quarter relative to the previous one. It definitely looks like an overall, though slight, improvement sitewide, a trend which we hope to keep throughout next quarter.

Conclusion

Although we saw a few noteworthy changes to individual pages, things remained fairly stable in Q2. Using box plots for this report helped us provide a more holistic representation of the data distribution, range and quality by looking at the quartile ranges and the outliers. For next quarter’s report we are really excited about the opportunity to continue exploring new, more efficient ways to visualize the quarterly data.

4 Comments