Crunching Apple Pay tokens in PHP

Posted by on November 20, 2015 / No Responses

Etsy is always trying to make it easier for members to buy and sell unique goods. And with 60% of our traffic now coming from mobile devices, making it easier to buy things on phones and tablets is a top priority. So when Apple Pay launched last year, we knew right away we wanted to offer it for our iOS users, and shipped it in April. Today we’re open sourcing part of our server-side solution, applepay-php, a PHP extension that verifies and decrypts Apple Pay payment tokens.

Integrating with Apple Pay comes down to two main areas of development: device side and payment-processing side. On the device side, at a high level, your app uses the PassKit framework to obtain an encrypted payment token which represents a user’s credit card info. On the payment-processing side, the goal is to make funds move between bank accounts. The first step here is to decrypt the payment token.

Many payment processors offer APIs to decrypt Apple Pay tokens on your behalf, but in our case, we wanted the flexibility of reading the tokens in-house. It turns out that doing this properly is pretty involved (to get an idea of the complexity, our solution defines 63 unique error codes), so we set out to find a pre-existing solution. Our search yielded a couple of open source projects, but none that fully complied with Apple’s spec. Notably, we couldn’t find any examples of verifying the chain of trust between Apple’s root CA and the payment signature, a critical component in guarding against forged payment tokens. We also couldn’t find any examples written in PHP (our primary language) or C (which could serve as the basis for a PHP extension). To meet our needs, we wrote a custom PHP extension on top of OpenSSL that exposes just two functions: applepay_verify_and_decrypt and applepay_last_error. This solution has worked really well for us over the past six months, so we figured we’d share it to make life easier for anyone else in a similar position.

Before releasing the code, we asked Syndis, a security consultancy based out of Iceland, to perform an external code review in addition to our everyday in-house code reviews. Syndis surveyed the code for both design flaws and implementation flaws. They found a few minor bugs but no actual vulnerabilities. Knowing that we wouldn’t be exposing users to undue risk gave us greater confidence to publish the code.

We’ve committed to using the open source version internally to avoid divergence, so expect to see future development on Github. Future work includes splitting off a generalized libapplepay (making it easier to write wrapper libraries for other languages), PHP7 compatibility, and an HHVM port. (By the way, if any of this sounds fun to you, we’d love for you to come work with us.)

We hope this release provides merchants with a solid solution for handling Apple Pay tokens. We also hope it inspires other organizations to consider open sourcing parts of their payment infrastructure.

You can follow Adam on Github @adsr.

Special thanks to Stephen Buckley, Keyur Govande, and Rasmus Lerdorf.

No Comments

Q3 2015 Site Performance Report

Posted by on November 10, 2015 / No Responses

Sadly, the summer has come to an end here in Brooklyn, but the changing of the leaves signifies one thing—it’s time to release our Q3 site performance report! For this report, we’ve collected data from a full week in September that we will be comparing to a full week of data from May. Similar to last quarter’s report, we will be using box plots to better visualize the data and the changes we’ve seen.

While we love to share stories of our wins, we find it equally important to report on the challenges we face. The prevailing pattern you will notice across all sections of this report is increased latency. Kristyn Reith will provide an update on backend server-side performance and Mike Adler, one of the newest members to the Performance team, will be reporting the synthetic frontend and the real user monitoring sections of this report.

Server-Side Performance

The server-side data below reflects the time seen by real users, both signed-in and signed-out. As a reminder, we are randomly sampling our data for all pages during the specified weeks in each quarter.

You can see that with the exception of the homepage, all of our pages have gotten slower on the backend. The performance team kicked off this quarter by hosting a post mortem for a site-wide performance degradation that occurred at the end of Q2. At that time, we had migrated a portion of our web servers to new, faster hardware, however the way the workload was initially distributed was overworking the old hardware, leading to poor performance for the 95th percentile. Increasing the weighting of the new hardware in the loadbalancer helped mitigate this. While medians did not see a significant impact over the course of the hardware change, it caused higher highs and lower lows for the 95th percentile. As a heavier page, the signed-in homepage saw the greatest improvement once the weights were adjusted, which contributed to its overall improvement this quarter. Other significant causes for the changes seen on the server side can be attributed to two new initiatives that were launched this quarter, Project Arizona and Category Navigation.

Arizona is a read-only key / value system to serve product recommendations and other generated datasets on a massive scale. It replaces a previous system that we had outgrown that stored all data in-memory; Arizona instead uses SSDs to allow for more and varied datasets. This quarter we launched the first phase of the project that resulted in some expected performance regressions compared with the previous memory-backed system. The first phase focused on correctness, ensuring data remained consistent between the two systems. Future phases will focus on optimizing speed of lookups to be comparable to the previous system while offering much greater scalability and availability.

In the beginning of August, our checkout team noticed two separate regressions on the cart page that had occurred over the course of the prior month. We had not been alerted on these slowdowns because at the end of Q2, the checkout team had launched cart pagination which improved the performance of the cart page by limiting the number of items loaded and we had not adjusted the thresholds to match this new normal. Luckily, the checkout team noticed the change in performance and we were able to trace the cause back to testing for Arizona.

While in the midst of testing for Arizona, we also launched a new site navigation bar that is included under the search bar on every page and features eight of the main shopping categories. Not only does the navigation bar make it easier for shoppers to find items on the site, but we also believe that the new navigation will positively affect Search Engine Optimization, driving more traffic to shops. While testing the feature we noticed some performance impacts so when the feature launched at the end of August, we were closely watching as we expected a performance degradation due to the amount of the HTML being generated. The performance impact was felt across the majority of our pages though it was more noticeable on some pages than others depending on the weight of the page. For example, lighter pages such as baseline appear harder hit because the navigation bar accounts for a significant amount of the page’s overall weight.

In an awesome win, in response to the anticipated performance hit, the buyer experience engineering team ramped up client side rendering for this new feature, which cut down the rendering time on buyer side pages by caching the HTML output and shipping less to the client.

In addition to the hardware change, Project Arizona and the new site navigation feature, we also have been investigating a slow, gradual regression we noticed across several pages that began in the first half of Q3. Extensive investigation and testing revealed that the regression was the result of limited CPU resources. We are currently adding additional CPU capacity and anticipate the affected pages will get faster in this current quarter.

Synthetic Start Render

Let’s move on to our synthetic tests where we have instrumented browsers load pages automatically every 10 minutes from several locations. This expands the scope of analysis to include browser-side measurements along with server-side. The strength of synthetic measurements is that we can get consistent, highly-detailed metrics about typical browser scenarios. We can look at “start render” to estimate when most people first see our pages loading.

Synthetic Start Render
The predominant observation is that our median render-start times across most pages has increased about 300ms compared to last quarter. You might expect a performance team to feel bummed out about a distinctly slower result, but we actually care about more about the overall user experience than just page speed measurements on any given week. The goal of our Performance team is not just to make a fast site, but to encourage discussions that accurately consider performance as one important concern among several.

This particular slowdown was caused by broader use of our new css toolkit, which adds 35k of CSS to every page. We expect the toolkit to be a net-win eventually, but we have to pay a temporary penalty while we work on eliminating non-standard styles. Several teams gathered together to discuss the impact of this change, which gave us confidence that Etsy’s culture of performance is continuing to mature, despite this particular batch of measurements.

The median render-start time for our search page appears to have increased by 800ms, following a similar degradation in the last quarter, but we found this to be misleading. We isolated this problem to IE browsers versions 10 and older, which actually represents a tiny fraction of Etsy users. The search page renders much faster (around 1100ms) in Chrome (far more popular), which is consistent with all our other pages across IE and Chrome.

Synthetic checks are vulnerable to this type of misleading measurement because it’s really difficult to build comprehensive labs that match the true diversity of browsers in the wild. RUM measurements are better suited to that task. We are currently discussing how to improve the browsers we use in our synthetic tests.

What was once a convenient metric for estimating experience may eventually become less meaningful as one fundamentally changes the way a site is loaded. We feel it is important to adapt our monitoring to the new realities of our product. We always want to be aligned with our product teams, helping them build the best experience, rather than spending precious time optimizing for metrics that were more useful in the past.

As it happens, we recently made a few product improvements around site navigation (mentioned in the above section). As we optimized the new version, we focused on end-user experience and it became clear that ‘Webpage Response’ was becoming less and less connected to end-user experience. WR includes the time for ALL assets loaded on the page, even if these requests are hidden from the end-user, such as deferred beacons.

We are evaluating alternative ways to estimate end-user experience in the future.

Real User Page Load Time

Real user monitoring give us insight into actual page loads experienced by end-users. Notably, it accounts for real-world diversity of network conditions, browser versions, and internationalization.

We can see across-the-board increases, which is in line with our other types of measurements. By looking at the daily summaries of these numbers, we confirmed that the RUM metrics regressed when we launched our revamped site navigation (first mentioned in the server-side section). Engineers at Etsy worked to optimize this feature over the next couple weeks and made progress, though one optimization ended up causing a regression on some browsers. This was not exposed except in our RUM data. We have a plan to speed this up during the fourth quarter.


In the third quarter, we had our ups and downs with site performance, due to both product and infrastructure changes. It is important to remember that performance cannot be reduced merely to page speed; it is a balancing act of many factors. Performance is a piece of the overall user experience and we are constantly improving our ability to evaluate performance and make wiser trade-offs to build the best experience. The slowdowns we saw this quarter have only reinforced our commitment to helping our engineering teams monitor and understand the impact of the new features and infrastructure changes they implement. We have several great optimizations and tools in the pipeline and we look forward to sharing the impact of these in the next report.

No Comments

Managing Hadoop Job Submission to Multiple Clusters

Posted by on September 24, 2015 / 2 Comments

At Etsy we have been running a Hadoop cluster in our datacenter since 2012.  This cluster handled both our scheduled production jobs as well as all ad hoc jobs.  After several years of running our entire workload on this one production Hadoop cluster, we recently built a second.  This has greatly expanded our capacity and ability to manage production and ad hoc workloads, and we got to have fun coming up with names for them (we settled on Pug and Basset!).  However, having more than one cluster has brought new challenges.  One of the more interesting issues that came up was how to manage the submission of ad hoc jobs with multiple clusters.

The Problem

As part of building out our second cluster we decided to split our current workload between the two clusters.  Our initial plan was to divide the Hadoop workload by having all scheduled production jobs run on one cluster and all ad hoc jobs on the other.  However, we recognized that those roles would change over time.  First, if there were an outage or we were performing maintenance on one of the clusters, we may shift all the workload to the other.  Also, as our workload changes or we introduce new technology, we may balance the workload differently between the two clusters.

When we had only one Hadoop cluster, users of Hadoop would not have to think about where to run their jobs.  Our goal was to keep it easy to run an ad hoc job without users needing to continually keep abreast of changes in which cluster to use.  The major obstacle for this goal is that all Hadoop users submit jobs from their developer VMs.  This means we would have to ensure that the changes necessary to switch which cluster should be used for ad hoc jobs propagate to all of the VMs in a timely fashion.  Otherwise some users would still be submitting their jobs to the wrong cluster, which could mean those jobs would fail or otherwise be disrupted. To simplify this and avoid such issues, we wanted a centralized mechanism for determining which cluster to use.

Other Issues

There were two related issues that we decided to address at the same time as managing the submission of ad hoc jobs to the correct cluster.  First, we wanted the cluster administrators to have the ability to disable ad hoc job submission entirely.  Previously we had relied on asking users via email and IRC to not submit jobs, which is only effective if everyone checks and sees that request before launching a job.  We wanted a more robust mechanism that would truly prevent running ad hoc jobs.  Also, we wanted a centralized location to view the client-side logs from running ad hoc jobs.  These would normally only be available in the user’s terminal, which complicates sharing these logs when getting help with debugging a problem.  We wanted both of these features regardless of having the second Hadoop cluster.  However, as we considered various approaches for managing ad hoc job submission to multiple clusters, we found that we could solve these problems at the same time.

Our Approach

We chose to use Apache Oozie to manage ad hoc job submission.  Using Oozie had several significant advantages for us.  First, we already were using Oozie for all of our scheduled production workflows.  As such we already understood it well and had it properly operationalized.  It also allowed us to reuse existing infrastructure rather than setting up something new, which greatly reduced the time and effort necessary to complete this project. Next, using Oozie let us distribute the load from the job client processes across the Hadoop cluster.  When ad hoc job submission occurred on users’ VMs, this load was naturally distributed.  Distributing this load across the Hadoop cluster allows this approach to grow with the cluster.  Moreover, using Oozie automatically provided a central location for viewing the client logs from job submission.  Since the clients run on the Hadoop cluster, their logs are available just like the logs from any other Hadoop job.  As such they can be shared and examined without needing to retrieve them from the user’s terminal.

There was one downside to using Oozie: it did not support automatically directing ad hoc jobs to the appropriate cluster or disabling the submission of ad hoc jobs.  We had to build this ourselves, but as Oozie was handling everything else it was very lightweight.  To minimize the amount of new infrastructure for this component, we used our existing internal API framework to manage this state.  We call this component the “state service”.

The Job Submission Process

Previously the process of submitting an ad hoc job looked like this:Original Job Submission Sequence Diagram
Now submitting an ad hoc job looks like this instead:


Job Submission Server Sequence Diagram


From the perspective of users nothing had changed; they would still launch jobs using our run_scalding script on their VM.  Internally, it would request the active ad hoc cluster using the API for the state service.  This API call would also indicate if ad hoc job submission was disabled, allowing the script to terminate.  Administrators can also set a message that would be displayed to users when this happens, which we use to provide information about why ad hoc jobs were disabled and the ETA on re-enabling them.

Once the script determined the cluster on which the job should run, it would generate an Oozie workflow from a template that would run the user’s job.  This occurs transparently to the user so that they do not have to be concerned about the details of the workflow definition.  The script then submits this generated workflow to Oozie, and the job runs.  The change most visible to users in this process is that the client logs no longer appear in their terminal as the job executes.  We considered trying to stream them from the cluster during execution, but to minimize complexity the script prints a link to the logs on the cluster after the job completes.

Other Options

While using Oozie ended up being the best choice for us, there were several other approaches we considered.

Apache Knox

Apache Knox is a gateway for Hadoop REST APIs.  The project primarily focuses on security, so it’s not an immediate drop-in solution for this problem.  However, it provides a gateway, similar to a reverse proxy, that maps externally exposed URLs to the URLs exposed by the actual Hadoop clusters.  We could have used this functionality to define URLs for an “ad hoc” cluster and change the Knox configuration to point that to the appropriate cluster.

Nevertheless, we felt Knox was not a good choice for this problem.  Knox is a complex project with a lot of features, but we would have been using only a small subset of these.  Furthermore, we would be using it outside of its intended use case, which could complicate applying it to solve our problem.  Since we did not have experience operating Knox at scale, we felt it would be better to stick with Oozie, which we already understood and would not have to shoehorn into our use case.

Custom Job Submission Server

We also considered implementing our own custom process to both manage the state of which cluster was currently active for ad hoc jobs as well as handling centralized job submission.  While this would have provided the most flexibility, it also meant building a lot of new infrastructure.  We would have essentially been reimplementing Oozie, but without any of the community testing or support.  Since we were already using Oozie and it met all our requirements, there was no need to build something custom.

Gateway Server

The final approach we considered was having a “gateway server” and requiring users to SSH to that server and launch jobs from there instead of from their VM.  This would have simplified the infrastructure components for job submission.  The Hadoop configuration changes to point ad hoc job submissions to the appropriate cluster or disable job submission entirely would only need to be deployed there.  By its very nature it would provide a central location for the client logs.  However, we would have to manage scaling and load balancing for this approach ourselves.  Furthermore, it would represent a significant departure from how development is normally done at Etsy.  Allowing users to write and run Hadoop jobs from their VM is important for keeping Hadoop as accessible as possible.  Adding the additional step of moving changes and SSH-ing to a gateway server compromises that goal.


Using Oozie to manage ad hoc job submission in this way has worked well for us.  Reusing the Oozie infrastructure we already had let us quickly build this out, and having this new process for running jobs made the transition to having two Hadoop clusters much easier.  Moreover, we were able to keep the process of submitting an ad hoc job almost identical to the previous process, which minimized the disruption for users of Hadoop.

As we were developing this, we found that there was only minimal discussion online about how other organizations have managed ad hoc job submission with multiple clusters.  Our hope is that this review of our approach as well as other options we considered is helpful if you are in the same situation and are looking for ideas for your own process of ad hoc job submission.


Assisted Serendipity – Fostering Peer to Peer Connections in Your Organization

Posted by on September 15, 2015 / 3 Comments

It happens at every growing company – one day you pass someone in the hallway of your office and have no idea whether they work with you, or if they’re just visiting your office. You used to know just about everyone at your company, but you’re growing so fast and hiring so quickly that it’s hard to keep up.  Even the most extroverted of us have a hard time learning everyone’s name when offices start expanding to different floors, different states, and even different countries.

One way to combat this problem is to give employees a means of being randomly introduced to each other.  We’ve already written a bit about culture hacking using a staff database, and the tool we’re open sourcing today takes advantage of this employee data that we make available within the company. The tool that we’re releasing is called Mixer. It’s a simple web app that allows people to join a group and then get randomly paired with another member of that group. It then prompts you to meet each other for a coffee, lunch, or a drink after work.  If the person you get paired up with is working remotely, that’s not a problem — just hop on a video chat.  This encourages people who may not work in the same place to stay in touch and find out what’s going on in each other’s day to day.  The tool keeps a history of the pairings and attempts to match you with someone unique each week; it’s possible to opt in or out of the program at any time.

mixer-app mixer-email

A lot of managers believe in the value of regular one-on-one meetings with their reports, but it is less common to do so with peers. At Etsy, these meetings between peers have resulted in cross-departmental partnerships that might not have otherwise surfaced, on top of providing an avenue of support for folks to work through difficult situations. These conversations also generally strengthen our culture by introducing people to their co-workers. Other benefits include learning more about what others are working on, brainstorming new collaborative projects that utilize strengths from a diverse set of core skillsets, and getting help with a challenge from someone who is distanced from the situation. Mixer meetings both introduce people who have never met and give folks who know each other a chance to connect in a way they might not have otherwise made time for.

As your company grows, it’s important to facilitate the person-to-person connections that happened naturally when everyone fit in the same small room. These interactions create the fabric of your company’s community and are crucial opportunities for building culture and fostering innovation. Our hope is that the Mixer tool can help you scale those genuine connections as you continue to see new faces in the hallway.

Find the Mixer code on Github


How Etsy Uses Thermodynamics to Help You Search for “Geeky”

Posted by on August 31, 2015 / 7 Comments

Etsy shoppers love the large and diverse selection of our marketplace. But, for those who don’t know exactly what they’re looking for, the sheer number and variety of items available can be more frustrating than delightful. In July, we introduced a new user interface which surfaces the top categories for a search request to help users explore the results for queries like “gift.” Searchers who issue broad queries like this often don’t have a specific type of item in mind, and are especially likely to finish their visit empty-handed. Our team lead, Gio, described our motivations and process in an (excellent) blog post last month, which gives more background on the project. In this post, I’ll focus on how we developed and iterated on our heuristic for classifying queries as “broad.”

Our navigable interface, shown for a query for “geeky gift”

Our navigable interface, shown for a query for “geeky gift”

Quantifying “broadness”

When I describe what I’m working on to people outside the team, they often jump in with a guess about how we use machine learning techniques to determine which queries are broad in code. While we could have used complex, offline signals like click or purchasing behavior to learn which queries should trigger the category interface, we actually base the decision on a single calculation, evaluated entirely at runtime, which uses very basic statistics about the search result set.

There have been several advantages to sticking with a simpler metric. By avoiding query-specific behavioral signals, our approach works for all languages and long-tail queries out of the gate. It’s performant and relatively easy to debug. It’s also (knock on wood) stable and easy to maintain, with very few external dependencies or moving parts. I’ll explain how we do it, and arguably justify the title of this post in the process.

Let’s take “geeky” as an example of a broad query, one that tells us very little about what type of item the user is looking for. Jewelry is the top category for “geeky,” but there are many items in all of the top-level categories.Top Categories for "Geeky" by Result Count

Compare to the distribution of results for “geeky mug,” which are predictably concentrated in the Home & Living category.

Top Categories for "Geeky Mug" by Result Count

In plainspeak, the calculation we use measures how spread out across the marketplace the items returned for the query are. The distribution of results for “geeky” suggests that the user might benefit from seeing the top categories, which demonstrate the breadth of geeky paraphernalia available on the site, from a periodic table-patterned bow tie to a “cutie pi” mug. The distribution for “geeky mug” is dominated by one category, and shouldn’t trigger the category interface.

The categories shown for a query for “geeky”

The categories shown for a query for “geeky”

Doing the math

In order to quantify how “spread out” items are, we start by taking the number of results returned for the query in each of the top-level categories and deriving the probability that an item is in each category. Since 20% of the items returned are in the Jewelry category and 15% of items are in the Accessories category, the probability values for Jewelry and Accessories would be .2 and .15 respectively. We use these values as the inputs to the Shannon entropy formula:

Shannon entropy formula

Shannon entropy formula

This formula is a measure of the disorder of a probability distribution. It’s essentially equivalent to the formula used to calculate the thermodynamic entropy of a physical system, which models a similar concept.

For our purposes, let rt be the total number of results and ri be the number of results for category i. Then the probability value in the above equation would be (ri /rt) and the entropy of the distribution of a search result set across its categories can be expressed as:

Search result entropy

Entropy of a search result set

In this way, we can determine when to show categories without using any offline signals. This is not to say that we didn’t use data in our development process at all. To determine the entropy threshold above which we should show categories, we looked at the entropies for a large sample of queries and made a fairly liberal judgement call on a good dividing line (i.e. a low threshold). Once we had results from an AB experiment which showed the new interface to real users, we looked to see how it affected user behavior for queries with lower entropy levels, and refined the cut-off based on the numbers. But this was a one-off analysis; we expect the threshold to be static over time, since the distribution of our marketplace across categories changes slowly.

Taking it to the next level

A broad query may not necessarily have high entropy at the top level of our taxonomy. Results for “geeky jewelry” are unsurprisingly concentrated in our Jewelry category, but there are still many types of items that are returned. We’d like to guide users into more specific subcategories, like Earrings and Necklaces, so we introduced a secondary round of entropy calculations for queries that don’t qualify as broad at the top level. It works like this: if the result set does not have sufficiently high entropy to trigger the categories at the top level, we determine the entropy within the most populous category (i.e. the entropy of its subcategories) and show those subcategories if that value exceeds our entropy cut-off.

Top Subcategories for "Jewelry" By Count

The graph above demonstrates the level of spread of results for the query “jewelry” across subcategories of the top-level Jewelry category. This method allows us to dive into top-level categories in cases like this, while sticking to a simple runtime decision based on category counts.

Showing subcategories for a query for “geek jewelry”

Showing subcategories for a query for “geek jewelry”

Iterating on entropy

While we were testing this approach, we noticed that a query like “shoes,” which we hoped would be high entropy within the Shoes category, was actually high entropy at the top level.

Top-level categories for the “shoes” query

Top-level categories for the “shoes” query… doesn’t seem quite right

Items returned for “shoes” are apparently sufficiently spread across the whole marketplace to trigger top-level groups, although there are an unusually high number of items in the Shoes category.

Top Categories for "Shoes" by Result Count

More generally, items in our marketplace tend to be concentrated in the most popular categories. A result set is likely to have many more Accessories items than Shoes items, because the former category is an order of magnitude larger than the latter. We want to be able to compensate for this uneven global distribution of items when we calculate the probabilities that we use in our entropy calculation.

By dividing the number of items in each category that are returned for the active search by the total number of items in that category, we get a number we can think of as the affinity between the category and the search query. Although fewer than 50% of the results that come back for a query for “shoes” are in the Shoes category, 100% of items in the Shoes category are returned for a query for “shoes,” so its category affinity is much higher than its raw share of the result set.

Top Categories for "Shoes" by Affinity

Normalizing the affinity values so they sum to one, we use these measurements as the inputs to the same Shannon entropy formula that we used in the first iteration. The normalization step ensures that we can compare entropy values across search result sets of different sizes. Letting ri represent the number of items in category i for the active search query, and ti represent the total number of items in that category, the affinity value for category i, ai, is simply (ri / ti). Taking s as the sum of all affinity values a0…ai, then, the affinity-based entropy is:

Affinity-based entropy of a search result set

Affinity-based entropy of a search result set

From a Bayesian perspective, both the original result count-based values and the affinity values calculate the probability that a listing is in a category given that it is returned for the search query. The difference is that the affinity formulation corresponds to a flat prior distribution of categories whereas the original formulation corresponds to the observed category distribution of items in our marketplace. By controlling for the uneven distribution of items across categories on Etsy, affinity-based entropy fixed our “shoes” problem, and improved the quality of our system in general.

Refining by recipient on a query for “geeky shoes”

Refining by recipient on a query for “geeky shoes”

Keeping it simple

Although our iterations on entropy have introduced more complexity than we had at the outset, we still reap the benefits of avoiding opaque offline computations and dependencies on external infrastructure. Big data signals can be incredibly powerful, but they introduce architectural costs that it turns out aren’t necessary for a functional broad query classifier.

On the user-facing level, making Etsy easier to explore is something I’ve wanted to work on since before I started working here many years ago. It’s very frustrating for searchers to navigate through the millions of items of all types that we return for many popular queries. If you’ll indulge my thermodynamics metaphor once more, by helping to guide users out of high-entropy result sets, we’re battling the heat death of Etsy search—and that’s literally pretty cool.


Couldn’t stomach that “heat death” joke? Leave a comment or let me know on twitter.

Huge thanks due to Giovanni Fernandez-Kincade, Stan Rozenraukh, Jaime Delanghe and Rob Hall.


Targeting Broad Queries in Search

Posted by on July 29, 2015 / 5 Comments

We’ve just launched some big improvements to the treatment of broad queries like “father’s day,” “upcycled,” or “boho chic” on Etsy. This is the most dramatic change to the search experience since our switch to relevance by default in 2011. In this post we’d like to give you an introduction to the product and its development process. We think it’s a great example of the values that are at the heart of product engineering at Etsy: leveraging simple techniques, building iteratively, and understanding impact.


Before we make a big investment in an idea, we like to spend some time investigating whether or not that idea represents a reasonable opportunity. The opportunity at the heart of this project is exploratory queries like “silver jewelry” where users don’t have something particular in mind. There are 2.7 MM results for “silver jewelry” on Etsy today. No matter how good we get at ranking results, the universe of silver jewelry is simply so vast that the chances that we will show you something you like are pretty slim.

How big of an opportunity is improving the experience for broad queries? How do we even define a broad query?

That’s a really difficult question. Going through this exercise can easily turn into doing the hardest parts of the “real work.” Instead of doing something clever, we time-boxed our analysis and looked at a handful of heuristics for different levels of user intent. Here’s a sample:

  1. Number of Tokens
  2. Result Set Size
  3. Number of Distinct Categories Represented in the Results

For each heuristic, we looked at the distribution across a week’s worth of search queries, and chose a threshold that generally separated the broad from the specific queries.


We looked at the size of that population and their engagement rates (the green arrow is our target audience):

Click Rate and Population by Search Tokens

None of the heuristics were independently sufficient, but by looking at several we were able to generate a rough estimate: it turns out that a sizable portion of searches on Etsy are broad queries. That matches our intuitions. Etsy is a marketplace of unique goods so it’s hard for consumers to know precisely what to look for.

Having some evidence that this was a worthwhile endeavor, we packed our bags and set off to meet the wizard.

Crafting an Experience

What can we do to improve the experience for users that issue a broad query? What about grouping the results into discrete buckets so users can get a better sense of what types of things are present? Grouping items into their respective categories seemed like an obvious starting place, but we could also group the items by any number of dimensions like style, color, and material.

We started with a few quick-and-dirty iterations of design and user-testing. Our designer fashioned a ton of static mocks that he turned into clickable prototypes using Flinto:


We followed this up with an unshippable prototype of result grouping on mobile web. We did the simplest possible thing: always show result groupings, regardless of how specific the query is. We even simulated a native version using JPEG technology:

Jpeg Tech

People responded really well to these treatments. Many even expressed a desire for the feature before they saw it: “I wish I could just see what types of jewelry there are.”

But the user tests also made it painfully clear how problematic false positives (showing groups when search is definitely not broad) were. There were moments of frustration where users clearly just wanted to see some results and the groups were getting in the way.

On the other hand, showing too many groups didn’t seem as costly. If random or questionably relevant groups appeared towards the end of the list, users often thought they were interesting  or highlighted what made Etsy unique (“I didn’t know you had those!”), adding a serendipitous flavor to the experience.

What’s a broad query?

Armed with a binder full of reasonable UX treatments, it was time to start tackling the algorithmic challenge. The heuristics we used at the beginning of this journey were sufficient for ballpark estimation, but they were fairly imprecise and it was clear that minimizing false positives was a priority.

We quickly settled on using entropy, which you can think of as a measure of the uncertainty in a probability distribution. In this case, we’re looking at the probability that a result belongs to a particular category.

Probability of Jewelry

As the probabilities get more concentrated around a handful of categories, the entropy approaches zero. For example, this is the probability distribution for the query “shoes” amidst the top-level categories:


As the distribution gets more dispersed, entropy increases. Here is the same distribution for “father’s day”:

Father's Day

We looked at samples of queries at different entropy levels to manually decide on a reasonable threshold.


Could we have trained a more sophisticated model with some supervised learning algorithms? Probably, but there are a host of challenges with that approach: getting hand-labeled data or dealing with the noise of using behavioral signals for training data, data sparsity/coverage, etc. Ultimately, we already had what we thought was the most discriminating factor, the resulting algorithm had an intuitive explanation that was easy to reason about, and we felt confident that it would scale to cover the long tail.

Conclusions and Coming Next

After a series of A/B experiments, we’re happy to report that result grouping has resulted in a dramatic increase in user engagement and we’re launching it. But this is only the beginning for this feature and for this story.

Henceforth, result grouping will be another lever in the search product toolbox. The work that we’ve been doing for the past year has really been about building a foundation. We’re going to be aggressively iterating on offline evaluation, new treatments, new grouping dimensions,  classification algorithms, and group ordering strategies. We’re in this for the long haul and we’re excited about the many doors this work has opened for us.

I hope this post gave you a taste for what went into this effort. In the coming months, we’re going to have many members of the Etsy Search family diving deeper into some of the meatier details on subjects like result grouping performance, iterating on the entropy-based algorithm, and how our new product categories laid the groundwork for these improvements.

Oh yeah, and we’re hiring.


Q2 2015 Site Performance Report

Posted by on July 13, 2015 / 4 Comments

We are kicking off the third quarter of 2015, which means it’s time to update you on how Etsy’s performance changed in Q2. Like in our last report, we’ve taken data from across an entire week in May and are comparing it with the data from an entire week in March. We’ve mixed things up in this report to better visualize our data and the changes in site speed:

As in the past, we’ve split up the sections of this report among members of our performance team. Allison McKnight will be reporting on the server-side portion, Kristyn Reith will be covering the synthetic front-end section and Natalya Hoota will be providing an update on the real user monitoring section. We have to give a special shout out to our bootcamper Emily Smith, who spent a week working with us and digging into the synthetic changes that we saw. So without further ado, let’s take a look at the numbers.

Server-Side Performance


Taking a look at our backend performance, we see that the quartile boundaries for home, listing, shop, and baseline pages haven’t changed much between Q1 and Q2. We see a change in the outliers for the shop and baseline pages – the outliers are more spread out (and the largest outlier is higher) in this quarter compared to the last quarter. For this report, we are going to focus on analyzing only changes in the quartile boundaries while we work on honing our outlier analysis skills and tools for future reports.


On the cart page, we see the top whisker and outliers move down. During the week in May when we pulled this data, we were running an experiment that added pagination to the cart. Some users have many items in their carts; these items take a long time to load on the backend. By limiting the number of items that we load on each cart page, we improve the backend load time for these users especially. If we were to look at the visit data in another format, we might see a bimodal distribution where users exposed to this experiment would have clearly different performance than users who didn’t see the experiment. Unfortunately, box plots limit our view on whether user experience could be statistically divided into two separate categories (i.e. multimodal distribution). We’re happy to say that we launched this feature in full earlier this week!


This quarter, the Search team experimented with new infrastructure that should make desktop and mobile experience more streamlined. On the backend, this translated into a slightly higher median time with an improvement for the slower end of users: the top whisker moved down from 511 ms to 447 ms, and the outliers moved down with it. The bottom whisker and the third quartile also moved down slightly while the first quartile moved up.

Taking a look at our timeseries record of search performance across the quarter, we see that a change was made that greatly impacted slower loads and had a smaller impact on median loads:


Synthetic Start Render and Webpage Response

Most things look very stable quarter over quarter for synthetic measurements of our site’s performance.


As we only started our synthetic measurements for the cart page in May, we do not have quarter-over-quarter data.


You can see that the start render time of the search page has gotten slower this quarter but that the webpage response time for search sped up. The regression in start render was caused by experiments being run by our search team, while the improvement in the webpage response time for search resulted from the implementation of the Etsy styleguide toolkit. The toolkit is a set of fully responsive components and utility classes that make layout fast and consistent. Switching to the new toolkit decreased the amount of custom CSS that we deliver on search pages by 85%.


As noted above, we are using a slightly different date range for the listing and shop data so that we can compare apples to apples. Taking a look at the webpage response time box plots, we see improvements to both the listing and shop pages. The faster webpage response time for the listing page can be attributed to an experiment running that reduced the page weight by altering the font-weights. The improvement to shop’s webpage response time is the result of migrating to a new tag manager that is used to track the performance of outside advertising campaigns. This migration allowed us to fully integrate third party platforms in new master tags which reduced the number of JS files for campaigns.

Real User Page Load Time

The software we use to measure our real user measurements, mPulse, was updated in the middle of this quarter, leading to a number of improvements in timer calculation and data collection and validation. Expectedly, we saw a much more comprehensive pattern in data outliers (i.e., values falling far above and below the average) on all pages, and are excited for this cleaner data set.


Since Q1 and Q2 data was collected with different versions of the real user monitoring software, it would not be scientifically accurate to make any conclusions about our user experiences this quarter relative to the previous one. It definitely looks like an overall, though slight, improvement sitewide, a trend which we hope to keep throughout next quarter.


Although we saw a few noteworthy changes to individual pages, things remained fairly stable in Q2. Using box plots for this report helped us provide a more holistic representation of the data distribution, range and quality by looking at the quartile ranges and the outliers. For next quarter’s report we are really excited about the opportunity to continue exploring new, more efficient ways to visualize the quarterly data.


Open Source Spring Cleaning

Posted by on July 9, 2015 / 1 Comment

At Etsy, we are big fans of Open Source. Etsy as it is wouldn’t exist without the myriad of people who have solved a problem and published their code under an open source license. We serve through the Apache web server running on Linux, our server-side code is mostly written in PHP, we store our data in MySQL, we track metrics using graphs from from Ganglia and Graphite to keep us up to date, and use Nagios to monitor the stability of our systems. And these are only the big examples. In every nook and cranny of our technology stack you can find Open Source code.

Part of everyone’s job at Etsy is what we call “Generosity of Spirit”, which means giving back to the industry. For engineers that means that we strive to give a talk at a conference, write a blog post on this blog or contribute to open source at least once every year. We love to give back to the Open Source community when we’ve created a solution to a problem that we think others might benefit from.

Maintenance and Divergence

This has led to many open sourced projects on our GitHub page and a continuing flow of contributions from our engineers and the Open Source community. We are not shy about open sourcing core parts of our technology stack. We are publicly developing our deployment system, metrics collector, team on-call management tool and our code search tool. We even open sourced the crucial parts of our atomic deployment system. And it has been very rewarding to receive bug fixes and features from the wider community that make our software more mature and stable.

As we open sourced more projects, it became tempting to run an internal fork of the project when we wanted to add new features quickly. These projects with internal forks quickly diverged from the open sourced versions. This meant the work to maintain the project was doubled. Anything fixed or changed internally had to be fixed or changed externally, and vice versa. In a busy engineering organization, the internal version usually was a priority over the public one. Looking at our GitHub page, it wasn’t clear – even to an Etsy engineer – whether or not we were actively maintaining a given project.

We end up with public projects that hadn’t been committed to in years. Open sourcers who were taking the time to file a bug report and didn’t get an answer on the issue, sometimes for years, which didn’t instill confidence in potential users. No one could tell whether the project is a maintained piece of software or a proof of concept that won’t get any updates.

Going forward

We want to do better by the Open Source community, since we’ve benefited so much from existing Open Source Software. We did a bit of Open Source spring cleaning to bring more clarity to the state of our open source projects. Going forward our projects will be clearly labeled as either maintained, not maintained, or archived.


Maintained projects are the default and are not specifically labeled as such. For maintained projects, we’re either running the open source version internally or currently working on getting our internal version back in sync with the public version. We already did this for our deployment tool in the past. We are actively working on any maintained projects: merging or commenting on pull requests, answering bug reports, and adding new features.

Not Maintained

We also have a few of projects that haven’t seen public updates in years. Usually this is because we haven’t found a way to make the project configurable in a way such that we can run the public version internally without slowing down our development cycles. However the code as it is available serves as a great proof of concept and illustrates how we approach the problem. Or it might have been a research project that we have abandoned because it turned out to not really solve our problem in the long run but still wanted to share what we tried. Those projects will just stay the way they are and likely will rarely receive any updates. We will turn off issues and pull requests on those and make it very clear in the README that this is a proof of concept only.


We also have a number of projects that we have open sourced because we were using them at one time but have since abandoned altogether. We have likely found that there exists a better solution to the problem or that the solution hasn’t proven useful in the long run. In those cases we will push a commit to the master branch that removes all code and only leaves the README with a description of the project and its status. The README will link to the last commit containing actual code. This way the code doesn’t just vanish, but the project is clearly not active. Those projects will also have issues and pull requests turned off.

In addition to the archival of those projects we will also start to delete forks of other Open Source projects that we’ve made at some point, but aren’t actively maintaining.

Closing thoughts

We have learned a lot about maintaining Open Source projects over the last couple of years. The main lesson we want to share is that it’s essential to use the Open Source version internally to provide a good experience for other Open Source developers who want to use our software. We strive to always learn and get better at everything we do. If you’ve been waiting for us to respond to an issue or merge a pull request, hopefully this will give you more insight into what has been going on and why it took so long for us to respond, and we hope that our new project labeling system will also give you more clarity about the state of our open source projects. In order to be good open source citizens we want to always do our best to give back in a way that is helpful for everyone. And a little spring cleaning is always a good thing. Even if it’s technically summer already.

You can follow Jared on Twitter here and Daniel

1 Comment

Four Months of statsd-jvm-profiler: A Retrospective

Posted by on May 12, 2015 / 10 Comments

It has been about four months since the initial open source release of statsd-jvm-profiler.  There has been a lot of development on it in that time, including the addition of several major new features.  Rather than just announcing exciting new things, this is a good opportunity to reflect on what has come of the project since open-sourcing it and how these new features came to be.

External Adoption

It has been very exciting to see statsd-jvm-profiler being adopted outside of Etsy, and we’ve learned a lot from talking to these new users.  It was initially built for Scalding, and many of the people who’ve tried it out have been profiling Scalding jobs.  However, I have spoken to people who are using it to profile jobs written in other MapReduce APIs, such as Scrunch, as well as pure MapReduce jobs.  Moreover, others have used it with tools in the broader Hadoop ecosystem, such as Spark or Storm.  Most interestingly, however, there have been a few people using statsd-jvm-profiler outside of Hadoop entirely, on enterprise Java applications.  There was never anything Hadoop-specific about the profiling functionality, but it was very gratifying to see that they were able to apply it unchanged to a domain so far from the initial use case.


One of the major benefits of open-sourcing a project is the ability to accept contributions from the community.  This has definitely been helpful for statsd-jvm-profiler.  There have been several pull requests accepted, both fixing bugs and adding new features.  Also, there are some active forks that the authors hopefully decide to contribute back.  The community of contributors is small, but the contributions have been valuable.  Questions about how to contribute were common, however, so the project now has contribution guidelines.

An unexpected aspect of community involvement in the project has been the amount of questions and suggestions that have come via email instead of through Github.  In hindsight setting up a mailing list for the project would have been a good idea; at the time of the initial release I had thought the utility of a mailing list for the project was low.  I have since created a mailing list for the project, but it would have been useful to have those original emails be publically available.  Nevertheless, the suggestions have been very helpful.  It would be amazing if everyone who had suggested improvements also sent pull requests, but I recognize that not everyone is willing or able to do so.  Even so I am grateful that people have been willing to contribute to the project in this way.

Internal Use

The use of statsd-jvm-profiler within Etsy has been less successful than it was externally.  We use Graphite as the backend for StatsD and as we started to use the profiler more, we began to have problems with Graphite.  Someone would start to profile a job, thus creating a fairly large number of new metrics.  This would sometimes cause Graphite to lock up and become unresponsive.  We put in some workarounds, including rate limiting the metric creation and configurable filtering of the metrics produced by CPU profiling, but these were ultimately only beneficial for smaller jobs.  Graphite is an important part of our infrastructure beyond statsd-jvm-profiler, so this was a bad situation.  Being able to profile and improve the performance of our Hadoop jobs is important, but not breaking critical pieces of infrastructure is more important.  The issues with Graphite meant that the ability to use the profiler was heavily restricted.  This was the exact opposite of the goal of easy to use, accessible profiling that motivated the creation of statsd-jvm-profiler.  Finally after breaking Graphite yet again the profiler was disabled entirely.  The project admittedly languished for about a month.  Since we weren’t using it internally, there was less incentive to continue improving it.

New Features

statsd-jvm-profiler was in an interesting state at this point.  There were still external users and internal interest, but it was too risky for us to actually use it.  Rather than abandon the project, I set out to bring to a better state, one where we could use it without risk to other parts of our production infrastructure.  The contributions from the community were incredibly helpful at this point.  Ultimately the new features were all developed internally, but the suggestions and feedback from the community provided lots of ideas for what to change that would both meet our internal needs as well as providing value externally.  As a result we’re able to use it internally again without DDOSing our Graphite infrastructure.

Multiple Metrics Backends

The idea of supporting multiple backends for metrics collection instead of just StatsD was considered during initial development, but was discarded to keep the profiling data flowing through StatsD and Graphite.  We use these extensively at Etsy, and the theory was that by keeping the profiling data in a familiar tool would make it more accessible.  In practice, however, the sheer volume of data produced from all the jobs we wanted to profile tended to overwhelm our production infrastructure.

Also, supporting different backends for metric collections was the most commonly requested feature from the community, and there were a lot of different suggestions for which to use.  StatsD is still the default backend, but it is configurable through the reporter argument to the profiler.  We are trying out InfluxDB as the first new backend.  There are a couple of reasons why it was selected.  First, statsd-jvm-profiler produces very bursty metrics in a very deep hierarchy.  This is fairly different than the normal use case for Graphite and we came to realise that Graphite was not the right tool for the job.  InfluxDB was very easy to set up and had better support for such metrics without needing any configuration.  Also, InfluxDB has a much richer, SQL-like query language.  With Graphite we had been dumping all of the metrics to a file and processing that, but InfluxDB’s query language allows for more complex visualization and analysis of the profiling data without needing the intermediate step.  So far InfluxDB has been working well.  Moreover, since it is independent from the rest of our production infrastructure only statsd-jvm-profiler will be affected if problems do arise.

Furthermore, the refactoring done to support InfluxDB in addition to StatsD has created a framework for supporting any number of backends.  This provides a great avenue for community contributions to support some other metric collection service.

New Dashboard

Better tooling for visualizing the data produced by profiling was another common feature request.  The initial release included a script for producing flame graphs, but it was somewhat hard to use.  Also, we had otherwise been using our internal framework for dashboards to get data from Graphite.  With the move to InfluxDB this wouldn’t be possible anymore.  As such we also needed a better visualization tool internally.

To that end statsd-jvm-profiler now includes a simple dashboard.  It is a Node.js application and pulls data from InfluxDB, leveraging its powerful query language.  It expects the metric prefix configured for the profiler to follow a certain pattern, but then you can select a particular process for which to view the profiling data:

Selecting a job from the statsd-jvm-profiler dashboard

From there it will display memory usage over the course of profiling:

Memory metrics

And it will also display the count of garbage collections and the total time spent in GC:

GC metrics

It can also produce an interactive flame graph:

Example flame graph

Embedded HTTP Server

Finally, the ability to disable CPU profiling after execution had started was the other most common feature request.  There was an option to disable it from the start, but not after the profiler was already running.  Both this and the ability to inspect some of the profiler state would have been useful for us while debugging the issues that arose with Graphite initially.  To support both of these features, statsd-jvm-profiler now has an embedded HTTP server.  By default this is accessible from port 5005 on the machine the application being profiled is running on, but this choice of port can be configured with the httpPort option to the profiler.  At present this both exposes some simple information about the profiler’s state and allows disabling collection of CPU or memory metrics.  Adding additional features here is another great place for community contributions.


Unequivocally statsd-jvm-profiler is better for having been open-sourced.  There has been a lot of activity on the project in the months since its initial public release.  It has seen adoption in a variety of use cases, including some quite different from those for which it was initially designed.  There has been a small but helpful community of contributors, both through code and through feedback and suggestions for the project.  When we hit issues using the project internally, the feedback from the community aligned very well with what we needed to get the project back on track and gave us momentum to keep going..

Going forward keeping up contributions from the community is definitely important to the success of the project.  There is a mailing list now, contribution guidelines, as well as some suggestions for how to contribute.  If you’d like to get involved or just try out statsd-jvm-profiler, it is available on Github!


Experimenting with HHVM at Etsy

Posted by on April 6, 2015 / 37 Comments

In 2014 Etsy’s infrastructure group took on a big challenge: scale Etsy’s API traffic capacity 20X. We launched many efforts simultaneously to meet the challenge, including a migration to HHVM after it showed a promising increase in throughput. Getting our code to run on HHVM was relatively easy, but we encountered many surprises as we gained confidence in the new architecture.

What is HHVM?

Etsy Engineering loves performance, so when Facebook announced the availability of the HipHop Virtual Machine for PHP, its reported leap in performance over current PHP implementations got us really excited.

HipHop Virtual Machine (HHVM) is an open-source virtual machine designed for executing programs written in PHP. HHVM uses a just-in-time (JIT) compilation approach to achieve superior performance while maintaining the development flexibility that PHP provides.

This post focuses on why we became interested in HHVM, how we gained confidence in it as a platform, the problems we encountered and the additional tools that HHVM provides. For more details on HHVM, including information on the JIT compiler, watch Sara Golemon and Paul Tarjan’s presentation from OSCON 2014.


In 2014 engineers at Etsy noticed two major problems with how we were building mobile products. First, we found ourselves having to rewrite logic that was designed for being executed in a web context to be executed in an API context. This led to feature drift between the mobile and web platforms as the amount of shared code decreased.

The second problem was how tempting it became for engineers to build lots of general API endpoints that could be called from many different mobile views. If you use too many of these endpoints to generate a single view on mobile you end up degrading that view’s performance. Ilya Grigorik’s “Breaking the 1000ms Time to Glass Mobile Barrier” presentation explains the pitfalls of this approach for mobile devices. To improve performance on mobile, we decided to create API endpoints that were custom to their view. Making one large API request is much more efficient than making many smaller requests. This efficiency cost us some reusability, though. Endpoints designed for Android listing views may not have all the data needed for a new design in iOS. The two platforms necessitate different designs in order to create a product that feels native to the platform. We needed to reconcile performance and reusability.

To do this, we developed “bespoke endpoints”. Bespoke endpoints aggregate smaller, reusable, cacheable REST endpoints. One request from the client triggers many requests on the server side for the reusable components. Each bespoke endpoint is specific to a view.

Consider this example listing view. The client makes one single request to a bespoke endpoint. That bespoke endpoint then makes many requests on behalf of the client. It aggregates the smaller REST endpoints and returns all of the data in one response to the client.

Bespoke Endpoint

Bespoke endpoints don’t just fetch data on behalf of the client, they can also do it concurrently. In the example above, the bespoke endpoint for the web view of a listing will fetch the listing, its overview, and the related listings simultaneously. It can do this thanks to curl_multi. Matt Graham’s talk “Concurrent PHP in the Etsy API” from phpDay 2014 goes into more detail on how we use curl_multi. In a future post we’ll share more details about bespoke endpoints and how they’ve changed both our native app and web development.

This method of building views became popular internally. Unfortunately, it also came with some drawbacks.

API traffic growth compared to Web traffic growth

Now that web pages had the potential to hit dozens of API endpoints, traffic on our API cluster grew more quickly than we anticipated. But that wasn’t the only problem.

Bootstrap Time Visualized

This graph represents all the concurrent requests that take place when loading the Etsy homepage. Between the red bars is work that is duplicated across all of the fanned out requests. This duplicate work is necessary because of the shared-nothing process architecture of PHP. For every request, we need to build the world: fetch the signed-in user, their settings, sanitize globals and so on. Although much of this duplicated work is carried out in parallel, the fan-out model still causes unnecessary work for our API cluster. But it does improve the observed response time for the user.

After considering many potential solutions to this problem, we concluded that trying to share state between processes in a shared-nothing architecture would inevitably end in tears. Instead, we decided to try speeding up all of our requests significantly, including the duplicated bootstrap work. HHVM seemed well-suited to the task. If this worked, we’d increase throughput on our API cluster and be able to scale much more efficiently.

Following months of iterations, improvements and bug fixes, HHVM now serves all of the fan-out requests for our bespoke endpoints. We used a variety of experiments to gain confidence in HHVM and to discover any bugs prior to deploying it in production.

The Experiments

Minimum Viable Product

The first experiment was simple: how many lines of PHP code do we have to comment out before HHVM will execute an Etsy API endpoint? The results surprised us. We only encountered one language incompatibility. All of the other problems we ran into were with HHVM extensions. There were several incompatibilities with the HHVM memcached extension, all of which we have since submitted pull requests for.

Does it solve our problem?

We then installed both PHP 5.4 and HHVM on a physical server and ran a synthetic benchmark. This benchmark randomly splayed requests across three API endpoints that were verified to work in HHVM, beginning at a rate of 10 requests per second and ramping up to 280 requests per second. The throughput results were promising.

The little green line at the bottom is HHVM response time

The little green line at the bottom is HHVM response time

Our PHP 5.4 configuration began to experience degraded performance at about 190 requests per second, while the same didn’t happen to HHVM until about 270 requests per second. This validated our assumption that HHVM could lead to higher throughput which would go a long way towards alleviating the load we had placed on our API cluster.

Gaining Confidence

So far we had validated that HHVM could run the Etsy API (at least with a certain amount of work) and that doing so would likely lead to increase in throughput. Now we had to become confident that HHVM could run correctly. We wanted to verify that responses returned from HHVM were identical to those returned by PHP. In addition our API’s full automated test suite and good old-fashioned manual testing we also turned to another technique: teeing traffic.

You can think of “tee” in this sense like tee on the command line. We wrote an iRule on our f5 load balancer to clone HTTP traffic destined for one pool and send it to another. This allowed us to take production traffic that was being sent to our API cluster and also send it onto our experimental HHVM cluster, as well as an isolated PHP cluster for comparison.

This proved to be a powerful tool. It allowed us to compare performance between two different configurations on the exact same traffic profile.

Note that this is on powerful hardware.

140 rps peak. Note that this is on powerful hardware.

On the same traffic profile HHVM required about half as much CPU as PHP did. While this wasn’t the reduction seen by the HHVM team, who claimed a third as much CPU should be expected, we were happy with it. Different applications will perform differently on HHVM. We suspect the reason we didn’t see a bigger win is that our internal API was designed to be as lightweight as possible. Internal API endpoints are primarily responsible for fetching data, and as a result tend to be more IO bound than others. HHVM optimizes CPU time, not IO time.

While teeing boosted our confidence in HHVM there were a couple hacks we had to put in place to get it to work. We didn’t want teed HTTP requests generating writes in our backend services. To that end we wrote read-only mysql, memcached and redis interfaces to prevent writes. As a result however, we weren’t yet confident that HHVM would write data correctly, or write the correct data. 

Employee Only Traffic

In order to gain confidence in that area we configured our bespoke endpoints to send all requests to the HHVM cluster if the user requesting the page was an employee. This put almost no load on the cluster, but allowed us to ensure that HHVM could communicate with backend services correctly. 

At this point we encountered some more incompatibilities with the memcached extension. We noticed that our API rate limiter was never able to find keys to decrement. This was caused by the decrement function being implemented incorrectly in the HHVM extension. In the process of debugging this we noticed that memcached was always returning false for every request HHVM made to it. This turned out to be a bug in the client-side hashing function present in HHVM. What we learned from this is that while the HHVM runtime is rock-solid, a lot of the included extensions aren’t. Facebook thoughtfully wrote a lot of the extensions specifically for the open source release of HHVM. However, many of them are not used internally because Facebook has their own clients for memcached and MySQL, and as a result have not seen nearly as much production traffic as the rest of the runtime. This is important to keep in mind when working with HHVM. We expect this situation will improve as more and more teams test it out and contribute patches back to the project, as we at Etsy will continue to do.

After resolving these issues it came time to slowly move production traffic from the PHP API cluster to the HHVM API cluster.

Slow Ramp Up

As we began the slow ramp in production we noticed some strange timestamps in the logs:

[23/janv./2015:22:40:32 +0000]

We even saw timestamps that looked like this:

[23/ 1月/2015:23:37:56]

At first we thought we had encountered a bug with HHVM’s logging system. As we investigated we realized the problem was more fundamental than that.

At Etsy we use the PHP function setlocale() to assist in localization. During a request, after we load a user we call setlocale() to set their locale preferences accordingly. The PHP function setlocale() is implemented using the system call setlocale(3). This system call is process-wide, affecting all the threads in a process. Most PHP SAPIs are implemented such that each request is handled by exactly one process, with many processes simultaneously handling many requests. 

HHVM is a threaded SAPI. HHVM runs as a single process with multiple threads where each thread is only handling exactly one request. When you call setlocale(3) in this context it affects the locale for all threads in that process. As a result, requests can come in and trample the locales set by other requests as illustrated in this animation.

locale overwriting

We have submitted a pull request re-implementing the PHP setlocale() function using thread-local equivalents. When migrating to HHVM it’s important to remember that HHVM is threaded, and different from most other SAPIs in common use. Do an audit of extensions you’re including and ensure that none of them cause side effects that could affect the state of other threads.


After rolling HHVM out to just the internal API cluster we saw a noticeable improvement in performance across several endpoints.


HHVM vs PHP on Etsy Internal API

It’s Not Just Speed

In the process of experimenting with HHVM we discovered a few under-documented features that are useful when running large PHP deployments.

Warming up HHVM

The HHVM team recommends that you warm up your HHVM process before having it serve production traffic:

“The cache locality of the JITted code is very important, and you want all your important endpoints code to be located close to each other in memory. The best way to accomplish this is to pick your most important requests (say 5) and cycle through them all serially until you’ve done them all 12 times. “ 

They show this being accomplished with a simple bash script paired with curl. There is a more robust method in the form of “warmup documents”. 

You specify a warmup document in an HDF file like this:

cmd = 1
url = /var/etsy/current/bin/hhvm/warmup.php // script to execute
remote_host =
remote_port = 35100
headers { // headers to pass into HHVM
0 {
name = Accept
value = */*
1 {
name = Host
value =
2 {
name = User-Agent
value = Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11

To tell HHVM to execute that warmup document on startup, simply reference it like so:

Server {
WarmupRequests {
* = /var/etsy/current/bin/hhvm/warmup.hdf

This will execute /var/etsy/current/bin/hhvm/warmup.php between when the HHVM binary is executed and when the process accepts connections. It will only execute it once however, and HHVM will not JIT any code until after the twelfth request. To execute a warmup document 12 times simply reference it 12 times from the config file, like so:

Server {
WarmupRequests {
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf

Profiling HHVM with perf(1)

HHVM makes it really easy to profile PHP code. One of the most interesting ways is with Linux’s perf tool.

HHVM is a JIT that converts PHP code into machine instructions. Because these instructions, or symbols, are not in the HHVM binary itself, perf cannot automatically translate these symbols into functions names. HHVM creates an interface to aid in this translation. It takes the form of a file in /tmp/ named according to this template:

/tmp/perf-<pid of process>.map 

The first column in the file is the address of the start of that function in memory. The second column is the length of the function in memory. And the third column is the function to print in perf.

Perf looks up processes it has recorded by their pid in /tmp to find and load these files. (The pid map file needs to be owned by the user running perf report, regardless of the permissions set on the file.) 

If you run

sudo perf record -p <pid> -ag -e instructions -o /tmp/ -- sleep 20

perf will record all of the symbols being executed for the given pid and the amount of CPU time that symbol was responsible for on the CPU over a period of 20 seconds. It stores that data in /tmp/

Once you have gathered data from perf with a command such as the above, you can display that data interactively in the terminal using `perf report`.

perf report

Click to embiggen

This show us a list of the most expensive functions (in terms of instructions executed on the CPU) being executed. Functions prefixed with HPHP:: are functions built into the language runtime. For example HPHP::f_sort accounts for all calls the PHP code makes to sort(). Functions prefixed with PHP:: are programmer-defined PHP functions. Here we can see that 36% of all CPU time occurred in Api_Handler::respond(), for example. Using perf() to profile PHP code is powerful on its own, but having the ability to jump from a PHP function into an HPHP function allows you to see what parts of your codebase HHVM doesn’t handle efficiently. Using this process we were able to determine that sort() calls were slow when enable_zend_sorting was enabled. After patching it to be more efficient, we realized a significant CPU and performance win:

CPU drop

Median perf drop

This change resulted in an additional increase in throughput across our API cluster as well as improved response times.

HHVM Interactive Debugger

HHVM provides an interactive debugger called “hphpd”. hphpd works similarly to gdb: it is a command line based interactive debugger. 

$ hhvm -c /etc/php.d/etsy.ini -m debug bin/test.php
Welcome to HipHop Debugger!
Type "help" or "?" for a complete list of commands.
Program bin/test.php loaded. Type '[r]un' or 'ontinue' to go.

Here we set a breakpoint on a function:

hphpd> b Shutdown::registerApiFunctions()
Breakpoint 1 set upon entering Shutdown::registerApiFunctions()
But wont break until class Shutdown has been loaded.
Commence execution until we encounter a breakpoint:
hphpd> continue
Breakpoint 1 reached at Shutdown::registerApiFunctions() on line 101 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
100     public static function registerApiFunctions() {
101*        self::registerFunction(['Shutdown', 'apiShutdown']);
102     }

Step into that function:

hphpd> step
Break at Shutdown::registerFunction() on line 74 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
73     public static function registerFunction() {
74*        $callback = func_get_args();

Step over that function:

hphpd> next
Break at Shutdown::registerFunction() on line 76 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
76*        if (empty($callback)) {
77             $bt = new Dev_Backtrace();
hphpd> next
Break at Shutdown::registerFunction() on line 82 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
81         }
82*        if (!is_callable($callback[0])) {
83             $bt = new Dev_Backtrace();

After adding a few lines to your configuration file you can use this debugger on any code that executes in HHVM.

Lessons Learned from the Experiment

The process of migrating our API cluster to HHVM taught us a lot about HHVM as well as how to better perform such migrations in the future. The ability to clone HTTP traffic and tee it to a read-only test cluster allowed us to gain confidence in HHVM much more quickly than we could have otherwise. While HHVM proved to be rock-solid as a language runtime, extensions proved to be less battle-tested. We frequently encountered bugs and missing features in the MySQL, Memcached and OAuth extensions, among others. Finally it’s important to remember that HHVM is threaded, which can result in a weird interplay between the runtime and system calls. The resulting behavior can be very surprising.

HHVM met our expectations. We were able to realize a greater throughput on our API cluster, as well as improved performance. Buying fewer servers also means less waste and less power consumption in our data centers, which is important to Etsy as a Benefit Corporation.


You can follow Dan on Twitter @jazzdan.

Special thanks Sara Golemon, Paul Tarjan and Josh Watzman at Facebook. Extra special thanks to Keyur Govande and Adam Saponara at Etsy.