Managing Hadoop Job Submission to Multiple Clusters

Posted by on September 24, 2015

At Etsy we have been running a Hadoop cluster in our datacenter since 2012.  This cluster handled both our scheduled production jobs as well as all ad hoc jobs.  After several years of running our entire workload on this one production Hadoop cluster, we recently built a second.  This has greatly expanded our capacity and ability to manage production and ad hoc workloads, and we got to have fun coming up with names for them (we settled on Pug and Basset!).  However, having more than one cluster has brought new challenges.  One of the more interesting issues that came up was how to manage the submission of ad hoc jobs with multiple clusters.

The Problem

As part of building out our second cluster we decided to split our current workload between the two clusters.  Our initial plan was to divide the Hadoop workload by having all scheduled production jobs run on one cluster and all ad hoc jobs on the other.  However, we recognized that those roles would change over time.  First, if there were an outage or we were performing maintenance on one of the clusters, we may shift all the workload to the other.  Also, as our workload changes or we introduce new technology, we may balance the workload differently between the two clusters.

When we had only one Hadoop cluster, users of Hadoop would not have to think about where to run their jobs.  Our goal was to keep it easy to run an ad hoc job without users needing to continually keep abreast of changes in which cluster to use.  The major obstacle for this goal is that all Hadoop users submit jobs from their developer VMs.  This means we would have to ensure that the changes necessary to switch which cluster should be used for ad hoc jobs propagate to all of the VMs in a timely fashion.  Otherwise some users would still be submitting their jobs to the wrong cluster, which could mean those jobs would fail or otherwise be disrupted. To simplify this and avoid such issues, we wanted a centralized mechanism for determining which cluster to use.

Other Issues

There were two related issues that we decided to address at the same time as managing the submission of ad hoc jobs to the correct cluster.  First, we wanted the cluster administrators to have the ability to disable ad hoc job submission entirely.  Previously we had relied on asking users via email and IRC to not submit jobs, which is only effective if everyone checks and sees that request before launching a job.  We wanted a more robust mechanism that would truly prevent running ad hoc jobs.  Also, we wanted a centralized location to view the client-side logs from running ad hoc jobs.  These would normally only be available in the user’s terminal, which complicates sharing these logs when getting help with debugging a problem.  We wanted both of these features regardless of having the second Hadoop cluster.  However, as we considered various approaches for managing ad hoc job submission to multiple clusters, we found that we could solve these problems at the same time.

Our Approach

We chose to use Apache Oozie to manage ad hoc job submission.  Using Oozie had several significant advantages for us.  First, we already were using Oozie for all of our scheduled production workflows.  As such we already understood it well and had it properly operationalized.  It also allowed us to reuse existing infrastructure rather than setting up something new, which greatly reduced the time and effort necessary to complete this project. Next, using Oozie let us distribute the load from the job client processes across the Hadoop cluster.  When ad hoc job submission occurred on users’ VMs, this load was naturally distributed.  Distributing this load across the Hadoop cluster allows this approach to grow with the cluster.  Moreover, using Oozie automatically provided a central location for viewing the client logs from job submission.  Since the clients run on the Hadoop cluster, their logs are available just like the logs from any other Hadoop job.  As such they can be shared and examined without needing to retrieve them from the user’s terminal.

There was one downside to using Oozie: it did not support automatically directing ad hoc jobs to the appropriate cluster or disabling the submission of ad hoc jobs.  We had to build this ourselves, but as Oozie was handling everything else it was very lightweight.  To minimize the amount of new infrastructure for this component, we used our existing internal API framework to manage this state.  We call this component the “state service”.

The Job Submission Process

Previously the process of submitting an ad hoc job looked like this:Original Job Submission Sequence Diagram
Now submitting an ad hoc job looks like this instead:


Job Submission Server Sequence Diagram


From the perspective of users nothing had changed; they would still launch jobs using our run_scalding script on their VM.  Internally, it would request the active ad hoc cluster using the API for the state service.  This API call would also indicate if ad hoc job submission was disabled, allowing the script to terminate.  Administrators can also set a message that would be displayed to users when this happens, which we use to provide information about why ad hoc jobs were disabled and the ETA on re-enabling them.

Once the script determined the cluster on which the job should run, it would generate an Oozie workflow from a template that would run the user’s job.  This occurs transparently to the user so that they do not have to be concerned about the details of the workflow definition.  The script then submits this generated workflow to Oozie, and the job runs.  The change most visible to users in this process is that the client logs no longer appear in their terminal as the job executes.  We considered trying to stream them from the cluster during execution, but to minimize complexity the script prints a link to the logs on the cluster after the job completes.

Other Options

While using Oozie ended up being the best choice for us, there were several other approaches we considered.

Apache Knox

Apache Knox is a gateway for Hadoop REST APIs.  The project primarily focuses on security, so it’s not an immediate drop-in solution for this problem.  However, it provides a gateway, similar to a reverse proxy, that maps externally exposed URLs to the URLs exposed by the actual Hadoop clusters.  We could have used this functionality to define URLs for an “ad hoc” cluster and change the Knox configuration to point that to the appropriate cluster.

Nevertheless, we felt Knox was not a good choice for this problem.  Knox is a complex project with a lot of features, but we would have been using only a small subset of these.  Furthermore, we would be using it outside of its intended use case, which could complicate applying it to solve our problem.  Since we did not have experience operating Knox at scale, we felt it would be better to stick with Oozie, which we already understood and would not have to shoehorn into our use case.

Custom Job Submission Server

We also considered implementing our own custom process to both manage the state of which cluster was currently active for ad hoc jobs as well as handling centralized job submission.  While this would have provided the most flexibility, it also meant building a lot of new infrastructure.  We would have essentially been reimplementing Oozie, but without any of the community testing or support.  Since we were already using Oozie and it met all our requirements, there was no need to build something custom.

Gateway Server

The final approach we considered was having a “gateway server” and requiring users to SSH to that server and launch jobs from there instead of from their VM.  This would have simplified the infrastructure components for job submission.  The Hadoop configuration changes to point ad hoc job submissions to the appropriate cluster or disable job submission entirely would only need to be deployed there.  By its very nature it would provide a central location for the client logs.  However, we would have to manage scaling and load balancing for this approach ourselves.  Furthermore, it would represent a significant departure from how development is normally done at Etsy.  Allowing users to write and run Hadoop jobs from their VM is important for keeping Hadoop as accessible as possible.  Adding the additional step of moving changes and SSH-ing to a gateway server compromises that goal.


Using Oozie to manage ad hoc job submission in this way has worked well for us.  Reusing the Oozie infrastructure we already had let us quickly build this out, and having this new process for running jobs made the transition to having two Hadoop clusters much easier.  Moreover, we were able to keep the process of submitting an ad hoc job almost identical to the previous process, which minimized the disruption for users of Hadoop.

As we were developing this, we found that there was only minimal discussion online about how other organizations have managed ad hoc job submission with multiple clusters.  Our hope is that this review of our approach as well as other options we considered is helpful if you are in the same situation and are looking for ideas for your own process of ad hoc job submission.

No Comments

Assisted Serendipity – Fostering Peer to Peer Connections in Your Organization

Posted by on September 15, 2015 / 1 Comment

It happens at every growing company – one day you pass someone in the hallway of your office and have no idea whether they work with you, or if they’re just visiting your office. You used to know just about everyone at your company, but you’re growing so fast and hiring so quickly that it’s hard to keep up.  Even the most extroverted of us have a hard time learning everyone’s name when offices start expanding to different floors, different states, and even different countries.

One way to combat this problem is to give employees a means of being randomly introduced to each other.  We’ve already written a bit about culture hacking using a staff database, and the tool we’re open sourcing today takes advantage of this employee data that we make available within the company. The tool that we’re releasing is called Mixer. It’s a simple web app that allows people to join a group and then get randomly paired with another member of that group. It then prompts you to meet each other for a coffee, lunch, or a drink after work.  If the person you get paired up with is working remotely, that’s not a problem — just hop on a video chat.  This encourages people who may not work in the same place to stay in touch and find out what’s going on in each other’s day to day.  The tool keeps a history of the pairings and attempts to match you with someone unique each week; it’s possible to opt in or out of the program at any time.

mixer-app mixer-email

A lot of managers believe in the value of regular one-on-one meetings with their reports, but it is less common to do so with peers. At Etsy, these meetings between peers have resulted in cross-departmental partnerships that might not have otherwise surfaced, on top of providing an avenue of support for folks to work through difficult situations. These conversations also generally strengthen our culture by introducing people to their co-workers. Other benefits include learning more about what others are working on, brainstorming new collaborative projects that utilize strengths from a diverse set of core skillsets, and getting help with a challenge from someone who is distanced from the situation. Mixer meetings both introduce people who have never met and give folks who know each other a chance to connect in a way they might not have otherwise made time for.

As your company grows, it’s important to facilitate the person-to-person connections that happened naturally when everyone fit in the same small room. These interactions create the fabric of your company’s community and are crucial opportunities for building culture and fostering innovation. Our hope is that the Mixer tool can help you scale those genuine connections as you continue to see new faces in the hallway.

Find the Mixer code on Github

1 Comment

How Etsy Uses Thermodynamics to Help You Search for “Geeky”

Posted by on August 31, 2015 / 6 Comments

Etsy shoppers love the large and diverse selection of our marketplace. But, for those who don’t know exactly what they’re looking for, the sheer number and variety of items available can be more frustrating than delightful. In July, we introduced a new user interface which surfaces the top categories for a search request to help users explore the results for queries like “gift.” Searchers who issue broad queries like this often don’t have a specific type of item in mind, and are especially likely to finish their visit empty-handed. Our team lead, Gio, described our motivations and process in an (excellent) blog post last month, which gives more background on the project. In this post, I’ll focus on how we developed and iterated on our heuristic for classifying queries as “broad.”

Our navigable interface, shown for a query for “geeky gift”

Our navigable interface, shown for a query for “geeky gift”

Quantifying “broadness”

When I describe what I’m working on to people outside the team, they often jump in with a guess about how we use machine learning techniques to determine which queries are broad in code. While we could have used complex, offline signals like click or purchasing behavior to learn which queries should trigger the category interface, we actually base the decision on a single calculation, evaluated entirely at runtime, which uses very basic statistics about the search result set.

There have been several advantages to sticking with a simpler metric. By avoiding query-specific behavioral signals, our approach works for all languages and long-tail queries out of the gate. It’s performant and relatively easy to debug. It’s also (knock on wood) stable and easy to maintain, with very few external dependencies or moving parts. I’ll explain how we do it, and arguably justify the title of this post in the process.

Let’s take “geeky” as an example of a broad query, one that tells us very little about what type of item the user is looking for. Jewelry is the top category for “geeky,” but there are many items in all of the top-level categories.Top Categories for "Geeky" by Result Count

Compare to the distribution of results for “geeky mug,” which are predictably concentrated in the Home & Living category.

Top Categories for "Geeky Mug" by Result Count

In plainspeak, the calculation we use measures how spread out across the marketplace the items returned for the query are. The distribution of results for “geeky” suggests that the user might benefit from seeing the top categories, which demonstrate the breadth of geeky paraphernalia available on the site, from a periodic table-patterned bow tie to a “cutie pi” mug. The distribution for “geeky mug” is dominated by one category, and shouldn’t trigger the category interface.

The categories shown for a query for “geeky”

The categories shown for a query for “geeky”

Doing the math

In order to quantify how “spread out” items are, we start by taking the number of results returned for the query in each of the top-level categories and deriving the probability that an item is in each category. Since 20% of the items returned are in the Jewelry category and 15% of items are in the Accessories category, the probability values for Jewelry and Accessories would be .2 and .15 respectively. We use these values as the inputs to the Shannon entropy formula:

Shannon entropy formula

Shannon entropy formula

This formula is a measure of the disorder of a probability distribution. It’s essentially equivalent to the formula used to calculate the thermodynamic entropy of a physical system, which models a similar concept.

For our purposes, let rt be the total number of results and ri be the number of results for category i. Then the probability value in the above equation would be (ri /rt) and the entropy of the distribution of a search result set across its categories can be expressed as:

Search result entropy

Entropy of a search result set

In this way, we can determine when to show categories without using any offline signals. This is not to say that we didn’t use data in our development process at all. To determine the entropy threshold above which we should show categories, we looked at the entropies for a large sample of queries and made a fairly liberal judgement call on a good dividing line (i.e. a low threshold). Once we had results from an AB experiment which showed the new interface to real users, we looked to see how it affected user behavior for queries with lower entropy levels, and refined the cut-off based on the numbers. But this was a one-off analysis; we expect the threshold to be static over time, since the distribution of our marketplace across categories changes slowly.

Taking it to the next level

A broad query may not necessarily have high entropy at the top level of our taxonomy. Results for “geeky jewelry” are unsurprisingly concentrated in our Jewelry category, but there are still many types of items that are returned. We’d like to guide users into more specific subcategories, like Earrings and Necklaces, so we introduced a secondary round of entropy calculations for queries that don’t qualify as broad at the top level. It works like this: if the result set does not have sufficiently high entropy to trigger the categories at the top level, we determine the entropy within the most populous category (i.e. the entropy of its subcategories) and show those subcategories if that value exceeds our entropy cut-off.

Top Subcategories for "Jewelry" By Count

The graph above demonstrates the level of spread of results for the query “jewelry” across subcategories of the top-level Jewelry category. This method allows us to dive into top-level categories in cases like this, while sticking to a simple runtime decision based on category counts.

Showing subcategories for a query for “geek jewelry”

Showing subcategories for a query for “geek jewelry”

Iterating on entropy

While we were testing this approach, we noticed that a query like “shoes,” which we hoped would be high entropy within the Shoes category, was actually high entropy at the top level.

Top-level categories for the “shoes” query

Top-level categories for the “shoes” query… doesn’t seem quite right

Items returned for “shoes” are apparently sufficiently spread across the whole marketplace to trigger top-level groups, although there are an unusually high number of items in the Shoes category.

Top Categories for "Shoes" by Result Count

More generally, items in our marketplace tend to be concentrated in the most popular categories. A result set is likely to have many more Accessories items than Shoes items, because the former category is an order of magnitude larger than the latter. We want to be able to compensate for this uneven global distribution of items when we calculate the probabilities that we use in our entropy calculation.

By dividing the number of items in each category that are returned for the active search by the total number of items in that category, we get a number we can think of as the affinity between the category and the search query. Although fewer than 50% of the results that come back for a query for “shoes” are in the Shoes category, 100% of items in the Shoes category are returned for a query for “shoes,” so its category affinity is much higher than its raw share of the result set.

Top Categories for "Shoes" by Affinity

Normalizing the affinity values so they sum to one, we use these measurements as the inputs to the same Shannon entropy formula that we used in the first iteration. The normalization step ensures that we can compare entropy values across search result sets of different sizes. Letting ri represent the number of items in category i for the active search query, and ti represent the total number of items in that category, the affinity value for category i, ai, is simply (ri / ti). Taking s as the sum of all affinity values a0…ai, then, the affinity-based entropy is:

Affinity-based entropy of a search result set

Affinity-based entropy of a search result set

From a Bayesian perspective, both the original result count-based values and the affinity values calculate the probability that a listing is in a category given that it is returned for the search query. The difference is that the affinity formulation corresponds to a flat prior distribution of categories whereas the original formulation corresponds to the observed category distribution of items in our marketplace. By controlling for the uneven distribution of items across categories on Etsy, affinity-based entropy fixed our “shoes” problem, and improved the quality of our system in general.

Refining by recipient on a query for “geeky shoes”

Refining by recipient on a query for “geeky shoes”

Keeping it simple

Although our iterations on entropy have introduced more complexity than we had at the outset, we still reap the benefits of avoiding opaque offline computations and dependencies on external infrastructure. Big data signals can be incredibly powerful, but they introduce architectural costs that it turns out aren’t necessary for a functional broad query classifier.

On the user-facing level, making Etsy easier to explore is something I’ve wanted to work on since before I started working here many years ago. It’s very frustrating for searchers to navigate through the millions of items of all types that we return for many popular queries. If you’ll indulge my thermodynamics metaphor once more, by helping to guide users out of high-entropy result sets, we’re battling the heat death of Etsy search—and that’s literally pretty cool.


Couldn’t stomach that “heat death” joke? Leave a comment or let me know on twitter.

Huge thanks due to Giovanni Fernandez-Kincade, Stan Rozenraukh, Jaime Delanghe and Rob Hall.


Targeting Broad Queries in Search

Posted by on July 29, 2015 / 4 Comments

We’ve just launched some big improvements to the treatment of broad queries like “father’s day,” “upcycled,” or “boho chic” on Etsy. This is the most dramatic change to the search experience since our switch to relevance by default in 2011. In this post we’d like to give you an introduction to the product and its development process. We think it’s a great example of the values that are at the heart of product engineering at Etsy: leveraging simple techniques, building iteratively, and understanding impact.


Before we make a big investment in an idea, we like to spend some time investigating whether or not that idea represents a reasonable opportunity. The opportunity at the heart of this project is exploratory queries like “silver jewelry” where users don’t have something particular in mind. There are 2.7 MM results for “silver jewelry” on Etsy today. No matter how good we get at ranking results, the universe of silver jewelry is simply so vast that the chances that we will show you something you like are pretty slim.

How big of an opportunity is improving the experience for broad queries? How do we even define a broad query?

That’s a really difficult question. Going through this exercise can easily turn into doing the hardest parts of the “real work.” Instead of doing something clever, we time-boxed our analysis and looked at a handful of heuristics for different levels of user intent. Here’s a sample:

  1. Number of Tokens
  2. Result Set Size
  3. Number of Distinct Categories Represented in the Results

For each heuristic, we looked at the distribution across a week’s worth of search queries, and chose a threshold that generally separated the broad from the specific queries.


We looked at the size of that population and their engagement rates (the green arrow is our target audience):

Click Rate and Population by Search Tokens

None of the heuristics were independently sufficient, but by looking at several we were able to generate a rough estimate: it turns out that a sizable portion of searches on Etsy are broad queries. That matches our intuitions. Etsy is a marketplace of unique goods so it’s hard for consumers to know precisely what to look for.

Having some evidence that this was a worthwhile endeavor, we packed our bags and set off to meet the wizard.

Crafting an Experience

What can we do to improve the experience for users that issue a broad query? What about grouping the results into discrete buckets so users can get a better sense of what types of things are present? Grouping items into their respective categories seemed like an obvious starting place, but we could also group the items by any number of dimensions like style, color, and material.

We started with a few quick-and-dirty iterations of design and user-testing. Our designer fashioned a ton of static mocks that he turned into clickable prototypes using Flinto:


We followed this up with an unshippable prototype of result grouping on mobile web. We did the simplest possible thing: always show result groupings, regardless of how specific the query is. We even simulated a native version using JPEG technology:

Jpeg Tech

People responded really well to these treatments. Many even expressed a desire for the feature before they saw it: “I wish I could just see what types of jewelry there are.”

But the user tests also made it painfully clear how problematic false positives (showing groups when search is definitely not broad) were. There were moments of frustration where users clearly just wanted to see some results and the groups were getting in the way.

On the other hand, showing too many groups didn’t seem as costly. If random or questionably relevant groups appeared towards the end of the list, users often thought they were interesting  or highlighted what made Etsy unique (“I didn’t know you had those!”), adding a serendipitous flavor to the experience.

What’s a broad query?

Armed with a binder full of reasonable UX treatments, it was time to start tackling the algorithmic challenge. The heuristics we used at the beginning of this journey were sufficient for ballpark estimation, but they were fairly imprecise and it was clear that minimizing false positives was a priority.

We quickly settled on using entropy, which you can think of as a measure of the uncertainty in a probability distribution. In this case, we’re looking at the probability that a result belongs to a particular category.

Probability of Jewelry

As the probabilities get more concentrated around a handful of categories, the entropy approaches zero. For example, this is the probability distribution for the query “shoes” amidst the top-level categories:


As the distribution gets more dispersed, entropy increases. Here is the same distribution for “father’s day”:

Father's Day

We looked at samples of queries at different entropy levels to manually decide on a reasonable threshold.


Could we have trained a more sophisticated model with some supervised learning algorithms? Probably, but there are a host of challenges with that approach: getting hand-labeled data or dealing with the noise of using behavioral signals for training data, data sparsity/coverage, etc. Ultimately, we already had what we thought was the most discriminating factor, the resulting algorithm had an intuitive explanation that was easy to reason about, and we felt confident that it would scale to cover the long tail.

Conclusions and Coming Next

After a series of A/B experiments, we’re happy to report that result grouping has resulted in a dramatic increase in user engagement and we’re launching it. But this is only the beginning for this feature and for this story.

Henceforth, result grouping will be another lever in the search product toolbox. The work that we’ve been doing for the past year has really been about building a foundation. We’re going to be aggressively iterating on offline evaluation, new treatments, new grouping dimensions,  classification algorithms, and group ordering strategies. We’re in this for the long haul and we’re excited about the many doors this work has opened for us.

I hope this post gave you a taste for what went into this effort. In the coming months, we’re going to have many members of the Etsy Search family diving deeper into some of the meatier details on subjects like result grouping performance, iterating on the entropy-based algorithm, and how our new product categories laid the groundwork for these improvements.

Oh yeah, and we’re hiring.


Q2 2015 Site Performance Report

Posted by on July 13, 2015 / 2 Comments

We are kicking off the third quarter of 2015, which means it’s time to update you on how Etsy’s performance changed in Q2. Like in our last report, we’ve taken data from across an entire week in May and are comparing it with the data from an entire week in March. We’ve mixed things up in this report to better visualize our data and the changes in site speed:

As in the past, we’ve split up the sections of this report among members of our performance team. Allison McKnight will be reporting on the server-side portion, Kristyn Reith will be covering the synthetic front-end section and Natalya Hoota will be providing an update on the real user monitoring section. We have to give a special shout out to our bootcamper Emily Smith, who spent a week working with us and digging into the synthetic changes that we saw. So without further ado, let’s take a look at the numbers.

Server-Side Performance


Taking a look at our backend performance, we see that the quartile boundaries for home, listing, shop, and baseline pages haven’t changed much between Q1 and Q2. We see a change in the outliers for the shop and baseline pages – the outliers are more spread out (and the largest outlier is higher) in this quarter compared to the last quarter. For this report, we are going to focus on analyzing only changes in the quartile boundaries while we work on honing our outlier analysis skills and tools for future reports.


On the cart page, we see the top whisker and outliers move down. During the week in May when we pulled this data, we were running an experiment that added pagination to the cart. Some users have many items in their carts; these items take a long time to load on the backend. By limiting the number of items that we load on each cart page, we improve the backend load time for these users especially. If we were to look at the visit data in another format, we might see a bimodal distribution where users exposed to this experiment would have clearly different performance than users who didn’t see the experiment. Unfortunately, box plots limit our view on whether user experience could be statistically divided into two separate categories (i.e. multimodal distribution). We’re happy to say that we launched this feature in full earlier this week!


This quarter, the Search team experimented with new infrastructure that should make desktop and mobile experience more streamlined. On the backend, this translated into a slightly higher median time with an improvement for the slower end of users: the top whisker moved down from 511 ms to 447 ms, and the outliers moved down with it. The bottom whisker and the third quartile also moved down slightly while the first quartile moved up.

Taking a look at our timeseries record of search performance across the quarter, we see that a change was made that greatly impacted slower loads and had a smaller impact on median loads:


Synthetic Start Render and Webpage Response

Most things look very stable quarter over quarter for synthetic measurements of our site’s performance.


As we only started our synthetic measurements for the cart page in May, we do not have quarter-over-quarter data.


You can see that the start render time of the search page has gotten slower this quarter but that the webpage response time for search sped up. The regression in start render was caused by experiments being run by our search team, while the improvement in the webpage response time for search resulted from the implementation of the Etsy styleguide toolkit. The toolkit is a set of fully responsive components and utility classes that make layout fast and consistent. Switching to the new toolkit decreased the amount of custom CSS that we deliver on search pages by 85%.


As noted above, we are using a slightly different date range for the listing and shop data so that we can compare apples to apples. Taking a look at the webpage response time box plots, we see improvements to both the listing and shop pages. The faster webpage response time for the listing page can be attributed to an experiment running that reduced the page weight by altering the font-weights. The improvement to shop’s webpage response time is the result of migrating to a new tag manager that is used to track the performance of outside advertising campaigns. This migration allowed us to fully integrate third party platforms in new master tags which reduced the number of JS files for campaigns.

Real User Page Load Time

The software we use to measure our real user measurements, mPulse, was updated in the middle of this quarter, leading to a number of improvements in timer calculation and data collection and validation. Expectedly, we saw a much more comprehensive pattern in data outliers (i.e., values falling far above and below the average) on all pages, and are excited for this cleaner data set.


Since Q1 and Q2 data was collected with different versions of the real user monitoring software, it would not be scientifically accurate to make any conclusions about our user experiences this quarter relative to the previous one. It definitely looks like an overall, though slight, improvement sitewide, a trend which we hope to keep throughout next quarter.


Although we saw a few noteworthy changes to individual pages, things remained fairly stable in Q2. Using box plots for this report helped us provide a more holistic representation of the data distribution, range and quality by looking at the quartile ranges and the outliers. For next quarter’s report we are really excited about the opportunity to continue exploring new, more efficient ways to visualize the quarterly data.


Open Source Spring Cleaning

Posted by on July 9, 2015 / 1 Comment

At Etsy, we are big fans of Open Source. Etsy as it is wouldn’t exist without the myriad of people who have solved a problem and published their code under an open source license. We serve through the Apache web server running on Linux, our server-side code is mostly written in PHP, we store our data in MySQL, we track metrics using graphs from from Ganglia and Graphite to keep us up to date, and use Nagios to monitor the stability of our systems. And these are only the big examples. In every nook and cranny of our technology stack you can find Open Source code.

Part of everyone’s job at Etsy is what we call “Generosity of Spirit”, which means giving back to the industry. For engineers that means that we strive to give a talk at a conference, write a blog post on this blog or contribute to open source at least once every year. We love to give back to the Open Source community when we’ve created a solution to a problem that we think others might benefit from.

Maintenance and Divergence

This has led to many open sourced projects on our GitHub page and a continuing flow of contributions from our engineers and the Open Source community. We are not shy about open sourcing core parts of our technology stack. We are publicly developing our deployment system, metrics collector, team on-call management tool and our code search tool. We even open sourced the crucial parts of our atomic deployment system. And it has been very rewarding to receive bug fixes and features from the wider community that make our software more mature and stable.

As we open sourced more projects, it became tempting to run an internal fork of the project when we wanted to add new features quickly. These projects with internal forks quickly diverged from the open sourced versions. This meant the work to maintain the project was doubled. Anything fixed or changed internally had to be fixed or changed externally, and vice versa. In a busy engineering organization, the internal version usually was a priority over the public one. Looking at our GitHub page, it wasn’t clear – even to an Etsy engineer – whether or not we were actively maintaining a given project.

We end up with public projects that hadn’t been committed to in years. Open sourcers who were taking the time to file a bug report and didn’t get an answer on the issue, sometimes for years, which didn’t instill confidence in potential users. No one could tell whether the project is a maintained piece of software or a proof of concept that won’t get any updates.

Going forward

We want to do better by the Open Source community, since we’ve benefited so much from existing Open Source Software. We did a bit of Open Source spring cleaning to bring more clarity to the state of our open source projects. Going forward our projects will be clearly labeled as either maintained, not maintained, or archived.


Maintained projects are the default and are not specifically labeled as such. For maintained projects, we’re either running the open source version internally or currently working on getting our internal version back in sync with the public version. We already did this for our deployment tool in the past. We are actively working on any maintained projects: merging or commenting on pull requests, answering bug reports, and adding new features.

Not Maintained

We also have a few of projects that haven’t seen public updates in years. Usually this is because we haven’t found a way to make the project configurable in a way such that we can run the public version internally without slowing down our development cycles. However the code as it is available serves as a great proof of concept and illustrates how we approach the problem. Or it might have been a research project that we have abandoned because it turned out to not really solve our problem in the long run but still wanted to share what we tried. Those projects will just stay the way they are and likely will rarely receive any updates. We will turn off issues and pull requests on those and make it very clear in the README that this is a proof of concept only.


We also have a number of projects that we have open sourced because we were using them at one time but have since abandoned altogether. We have likely found that there exists a better solution to the problem or that the solution hasn’t proven useful in the long run. In those cases we will push a commit to the master branch that removes all code and only leaves the README with a description of the project and its status. The README will link to the last commit containing actual code. This way the code doesn’t just vanish, but the project is clearly not active. Those projects will also have issues and pull requests turned off.

In addition to the archival of those projects we will also start to delete forks of other Open Source projects that we’ve made at some point, but aren’t actively maintaining.

Closing thoughts

We have learned a lot about maintaining Open Source projects over the last couple of years. The main lesson we want to share is that it’s essential to use the Open Source version internally to provide a good experience for other Open Source developers who want to use our software. We strive to always learn and get better at everything we do. If you’ve been waiting for us to respond to an issue or merge a pull request, hopefully this will give you more insight into what has been going on and why it took so long for us to respond, and we hope that our new project labeling system will also give you more clarity about the state of our open source projects. In order to be good open source citizens we want to always do our best to give back in a way that is helpful for everyone. And a little spring cleaning is always a good thing. Even if it’s technically summer already.

You can follow Jared on Twitter here and Daniel

1 Comment

Four Months of statsd-jvm-profiler: A Retrospective

Posted by on May 12, 2015 / 10 Comments

It has been about four months since the initial open source release of statsd-jvm-profiler.  There has been a lot of development on it in that time, including the addition of several major new features.  Rather than just announcing exciting new things, this is a good opportunity to reflect on what has come of the project since open-sourcing it and how these new features came to be.

External Adoption

It has been very exciting to see statsd-jvm-profiler being adopted outside of Etsy, and we’ve learned a lot from talking to these new users.  It was initially built for Scalding, and many of the people who’ve tried it out have been profiling Scalding jobs.  However, I have spoken to people who are using it to profile jobs written in other MapReduce APIs, such as Scrunch, as well as pure MapReduce jobs.  Moreover, others have used it with tools in the broader Hadoop ecosystem, such as Spark or Storm.  Most interestingly, however, there have been a few people using statsd-jvm-profiler outside of Hadoop entirely, on enterprise Java applications.  There was never anything Hadoop-specific about the profiling functionality, but it was very gratifying to see that they were able to apply it unchanged to a domain so far from the initial use case.


One of the major benefits of open-sourcing a project is the ability to accept contributions from the community.  This has definitely been helpful for statsd-jvm-profiler.  There have been several pull requests accepted, both fixing bugs and adding new features.  Also, there are some active forks that the authors hopefully decide to contribute back.  The community of contributors is small, but the contributions have been valuable.  Questions about how to contribute were common, however, so the project now has contribution guidelines.

An unexpected aspect of community involvement in the project has been the amount of questions and suggestions that have come via email instead of through Github.  In hindsight setting up a mailing list for the project would have been a good idea; at the time of the initial release I had thought the utility of a mailing list for the project was low.  I have since created a mailing list for the project, but it would have been useful to have those original emails be publically available.  Nevertheless, the suggestions have been very helpful.  It would be amazing if everyone who had suggested improvements also sent pull requests, but I recognize that not everyone is willing or able to do so.  Even so I am grateful that people have been willing to contribute to the project in this way.

Internal Use

The use of statsd-jvm-profiler within Etsy has been less successful than it was externally.  We use Graphite as the backend for StatsD and as we started to use the profiler more, we began to have problems with Graphite.  Someone would start to profile a job, thus creating a fairly large number of new metrics.  This would sometimes cause Graphite to lock up and become unresponsive.  We put in some workarounds, including rate limiting the metric creation and configurable filtering of the metrics produced by CPU profiling, but these were ultimately only beneficial for smaller jobs.  Graphite is an important part of our infrastructure beyond statsd-jvm-profiler, so this was a bad situation.  Being able to profile and improve the performance of our Hadoop jobs is important, but not breaking critical pieces of infrastructure is more important.  The issues with Graphite meant that the ability to use the profiler was heavily restricted.  This was the exact opposite of the goal of easy to use, accessible profiling that motivated the creation of statsd-jvm-profiler.  Finally after breaking Graphite yet again the profiler was disabled entirely.  The project admittedly languished for about a month.  Since we weren’t using it internally, there was less incentive to continue improving it.

New Features

statsd-jvm-profiler was in an interesting state at this point.  There were still external users and internal interest, but it was too risky for us to actually use it.  Rather than abandon the project, I set out to bring to a better state, one where we could use it without risk to other parts of our production infrastructure.  The contributions from the community were incredibly helpful at this point.  Ultimately the new features were all developed internally, but the suggestions and feedback from the community provided lots of ideas for what to change that would both meet our internal needs as well as providing value externally.  As a result we’re able to use it internally again without DDOSing our Graphite infrastructure.

Multiple Metrics Backends

The idea of supporting multiple backends for metrics collection instead of just StatsD was considered during initial development, but was discarded to keep the profiling data flowing through StatsD and Graphite.  We use these extensively at Etsy, and the theory was that by keeping the profiling data in a familiar tool would make it more accessible.  In practice, however, the sheer volume of data produced from all the jobs we wanted to profile tended to overwhelm our production infrastructure.

Also, supporting different backends for metric collections was the most commonly requested feature from the community, and there were a lot of different suggestions for which to use.  StatsD is still the default backend, but it is configurable through the reporter argument to the profiler.  We are trying out InfluxDB as the first new backend.  There are a couple of reasons why it was selected.  First, statsd-jvm-profiler produces very bursty metrics in a very deep hierarchy.  This is fairly different than the normal use case for Graphite and we came to realise that Graphite was not the right tool for the job.  InfluxDB was very easy to set up and had better support for such metrics without needing any configuration.  Also, InfluxDB has a much richer, SQL-like query language.  With Graphite we had been dumping all of the metrics to a file and processing that, but InfluxDB’s query language allows for more complex visualization and analysis of the profiling data without needing the intermediate step.  So far InfluxDB has been working well.  Moreover, since it is independent from the rest of our production infrastructure only statsd-jvm-profiler will be affected if problems do arise.

Furthermore, the refactoring done to support InfluxDB in addition to StatsD has created a framework for supporting any number of backends.  This provides a great avenue for community contributions to support some other metric collection service.

New Dashboard

Better tooling for visualizing the data produced by profiling was another common feature request.  The initial release included a script for producing flame graphs, but it was somewhat hard to use.  Also, we had otherwise been using our internal framework for dashboards to get data from Graphite.  With the move to InfluxDB this wouldn’t be possible anymore.  As such we also needed a better visualization tool internally.

To that end statsd-jvm-profiler now includes a simple dashboard.  It is a Node.js application and pulls data from InfluxDB, leveraging its powerful query language.  It expects the metric prefix configured for the profiler to follow a certain pattern, but then you can select a particular process for which to view the profiling data:

Selecting a job from the statsd-jvm-profiler dashboard

From there it will display memory usage over the course of profiling:

Memory metrics

And it will also display the count of garbage collections and the total time spent in GC:

GC metrics

It can also produce an interactive flame graph:

Example flame graph

Embedded HTTP Server

Finally, the ability to disable CPU profiling after execution had started was the other most common feature request.  There was an option to disable it from the start, but not after the profiler was already running.  Both this and the ability to inspect some of the profiler state would have been useful for us while debugging the issues that arose with Graphite initially.  To support both of these features, statsd-jvm-profiler now has an embedded HTTP server.  By default this is accessible from port 5005 on the machine the application being profiled is running on, but this choice of port can be configured with the httpPort option to the profiler.  At present this both exposes some simple information about the profiler’s state and allows disabling collection of CPU or memory metrics.  Adding additional features here is another great place for community contributions.


Unequivocally statsd-jvm-profiler is better for having been open-sourced.  There has been a lot of activity on the project in the months since its initial public release.  It has seen adoption in a variety of use cases, including some quite different from those for which it was initially designed.  There has been a small but helpful community of contributors, both through code and through feedback and suggestions for the project.  When we hit issues using the project internally, the feedback from the community aligned very well with what we needed to get the project back on track and gave us momentum to keep going..

Going forward keeping up contributions from the community is definitely important to the success of the project.  There is a mailing list now, contribution guidelines, as well as some suggestions for how to contribute.  If you’d like to get involved or just try out statsd-jvm-profiler, it is available on Github!


Experimenting with HHVM at Etsy

Posted by on April 6, 2015 / 34 Comments

In 2014 Etsy’s infrastructure group took on a big challenge: scale Etsy’s API traffic capacity 20X. We launched many efforts simultaneously to meet the challenge, including a migration to HHVM after it showed a promising increase in throughput. Getting our code to run on HHVM was relatively easy, but we encountered many surprises as we gained confidence in the new architecture.

What is HHVM?

Etsy Engineering loves performance, so when Facebook announced the availability of the HipHop Virtual Machine for PHP, its reported leap in performance over current PHP implementations got us really excited.

HipHop Virtual Machine (HHVM) is an open-source virtual machine designed for executing programs written in PHP. HHVM uses a just-in-time (JIT) compilation approach to achieve superior performance while maintaining the development flexibility that PHP provides.

This post focuses on why we became interested in HHVM, how we gained confidence in it as a platform, the problems we encountered and the additional tools that HHVM provides. For more details on HHVM, including information on the JIT compiler, watch Sara Golemon and Paul Tarjan’s presentation from OSCON 2014.


In 2014 engineers at Etsy noticed two major problems with how we were building mobile products. First, we found ourselves having to rewrite logic that was designed for being executed in a web context to be executed in an API context. This led to feature drift between the mobile and web platforms as the amount of shared code decreased.

The second problem was how tempting it became for engineers to build lots of general API endpoints that could be called from many different mobile views. If you use too many of these endpoints to generate a single view on mobile you end up degrading that view’s performance. Ilya Grigorik’s “Breaking the 1000ms Time to Glass Mobile Barrier” presentation explains the pitfalls of this approach for mobile devices. To improve performance on mobile, we decided to create API endpoints that were custom to their view. Making one large API request is much more efficient than making many smaller requests. This efficiency cost us some reusability, though. Endpoints designed for Android listing views may not have all the data needed for a new design in iOS. The two platforms necessitate different designs in order to create a product that feels native to the platform. We needed to reconcile performance and reusability.

To do this, we developed “bespoke endpoints”. Bespoke endpoints aggregate smaller, reusable, cacheable REST endpoints. One request from the client triggers many requests on the server side for the reusable components. Each bespoke endpoint is specific to a view.

Consider this example listing view. The client makes one single request to a bespoke endpoint. That bespoke endpoint then makes many requests on behalf of the client. It aggregates the smaller REST endpoints and returns all of the data in one response to the client.

Bespoke Endpoint

Bespoke endpoints don’t just fetch data on behalf of the client, they can also do it concurrently. In the example above, the bespoke endpoint for the web view of a listing will fetch the listing, its overview, and the related listings simultaneously. It can do this thanks to curl_multi. Matt Graham’s talk “Concurrent PHP in the Etsy API” from phpDay 2014 goes into more detail on how we use curl_multi. In a future post we’ll share more details about bespoke endpoints and how they’ve changed both our native app and web development.

This method of building views became popular internally. Unfortunately, it also came with some drawbacks.

API traffic growth compared to Web traffic growth

Now that web pages had the potential to hit dozens of API endpoints, traffic on our API cluster grew more quickly than we anticipated. But that wasn’t the only problem.

Bootstrap Time Visualized

This graph represents all the concurrent requests that take place when loading the Etsy homepage. Between the red bars is work that is duplicated across all of the fanned out requests. This duplicate work is necessary because of the shared-nothing process architecture of PHP. For every request, we need to build the world: fetch the signed-in user, their settings, sanitize globals and so on. Although much of this duplicated work is carried out in parallel, the fan-out model still causes unnecessary work for our API cluster. But it does improve the observed response time for the user.

After considering many potential solutions to this problem, we concluded that trying to share state between processes in a shared-nothing architecture would inevitably end in tears. Instead, we decided to try speeding up all of our requests significantly, including the duplicated bootstrap work. HHVM seemed well-suited to the task. If this worked, we’d increase throughput on our API cluster and be able to scale much more efficiently.

Following months of iterations, improvements and bug fixes, HHVM now serves all of the fan-out requests for our bespoke endpoints. We used a variety of experiments to gain confidence in HHVM and to discover any bugs prior to deploying it in production.

The Experiments

Minimum Viable Product

The first experiment was simple: how many lines of PHP code do we have to comment out before HHVM will execute an Etsy API endpoint? The results surprised us. We only encountered one language incompatibility. All of the other problems we ran into were with HHVM extensions. There were several incompatibilities with the HHVM memcached extension, all of which we have since submitted pull requests for.

Does it solve our problem?

We then installed both PHP 5.4 and HHVM on a physical server and ran a synthetic benchmark. This benchmark randomly splayed requests across three API endpoints that were verified to work in HHVM, beginning at a rate of 10 requests per second and ramping up to 280 requests per second. The throughput results were promising.

The little green line at the bottom is HHVM response time

The little green line at the bottom is HHVM response time

Our PHP 5.4 configuration began to experience degraded performance at about 190 requests per second, while the same didn’t happen to HHVM until about 270 requests per second. This validated our assumption that HHVM could lead to higher throughput which would go a long way towards alleviating the load we had placed on our API cluster.

Gaining Confidence

So far we had validated that HHVM could run the Etsy API (at least with a certain amount of work) and that doing so would likely lead to increase in throughput. Now we had to become confident that HHVM could run correctly. We wanted to verify that responses returned from HHVM were identical to those returned by PHP. In addition our API’s full automated test suite and good old-fashioned manual testing we also turned to another technique: teeing traffic.

You can think of “tee” in this sense like tee on the command line. We wrote an iRule on our f5 load balancer to clone HTTP traffic destined for one pool and send it to another. This allowed us to take production traffic that was being sent to our API cluster and also send it onto our experimental HHVM cluster, as well as an isolated PHP cluster for comparison.

This proved to be a powerful tool. It allowed us to compare performance between two different configurations on the exact same traffic profile.

Note that this is on powerful hardware.

140 rps peak. Note that this is on powerful hardware.

On the same traffic profile HHVM required about half as much CPU as PHP did. While this wasn’t the reduction seen by the HHVM team, who claimed a third as much CPU should be expected, we were happy with it. Different applications will perform differently on HHVM. We suspect the reason we didn’t see a bigger win is that our internal API was designed to be as lightweight as possible. Internal API endpoints are primarily responsible for fetching data, and as a result tend to be more IO bound than others. HHVM optimizes CPU time, not IO time.

While teeing boosted our confidence in HHVM there were a couple hacks we had to put in place to get it to work. We didn’t want teed HTTP requests generating writes in our backend services. To that end we wrote read-only mysql, memcached and redis interfaces to prevent writes. As a result however, we weren’t yet confident that HHVM would write data correctly, or write the correct data. 

Employee Only Traffic

In order to gain confidence in that area we configured our bespoke endpoints to send all requests to the HHVM cluster if the user requesting the page was an employee. This put almost no load on the cluster, but allowed us to ensure that HHVM could communicate with backend services correctly. 

At this point we encountered some more incompatibilities with the memcached extension. We noticed that our API rate limiter was never able to find keys to decrement. This was caused by the decrement function being implemented incorrectly in the HHVM extension. In the process of debugging this we noticed that memcached was always returning false for every request HHVM made to it. This turned out to be a bug in the client-side hashing function present in HHVM. What we learned from this is that while the HHVM runtime is rock-solid, a lot of the included extensions aren’t. Facebook thoughtfully wrote a lot of the extensions specifically for the open source release of HHVM. However, many of them are not used internally because Facebook has their own clients for memcached and MySQL, and as a result have not seen nearly as much production traffic as the rest of the runtime. This is important to keep in mind when working with HHVM. We expect this situation will improve as more and more teams test it out and contribute patches back to the project, as we at Etsy will continue to do.

After resolving these issues it came time to slowly move production traffic from the PHP API cluster to the HHVM API cluster.

Slow Ramp Up

As we began the slow ramp in production we noticed some strange timestamps in the logs:

[23/janv./2015:22:40:32 +0000]

We even saw timestamps that looked like this:

[23/ 1月/2015:23:37:56]

At first we thought we had encountered a bug with HHVM’s logging system. As we investigated we realized the problem was more fundamental than that.

At Etsy we use the PHP function setlocale() to assist in localization. During a request, after we load a user we call setlocale() to set their locale preferences accordingly. The PHP function setlocale() is implemented using the system call setlocale(3). This system call is process-wide, affecting all the threads in a process. Most PHP SAPIs are implemented such that each request is handled by exactly one process, with many processes simultaneously handling many requests. 

HHVM is a threaded SAPI. HHVM runs as a single process with multiple threads where each thread is only handling exactly one request. When you call setlocale(3) in this context it affects the locale for all threads in that process. As a result, requests can come in and trample the locales set by other requests as illustrated in this animation.

locale overwriting

We have submitted a pull request re-implementing the PHP setlocale() function using thread-local equivalents. When migrating to HHVM it’s important to remember that HHVM is threaded, and different from most other SAPIs in common use. Do an audit of extensions you’re including and ensure that none of them cause side effects that could affect the state of other threads.


After rolling HHVM out to just the internal API cluster we saw a noticeable improvement in performance across several endpoints.


HHVM vs PHP on Etsy Internal API

It’s Not Just Speed

In the process of experimenting with HHVM we discovered a few under-documented features that are useful when running large PHP deployments.

Warming up HHVM

The HHVM team recommends that you warm up your HHVM process before having it serve production traffic:

“The cache locality of the JITted code is very important, and you want all your important endpoints code to be located close to each other in memory. The best way to accomplish this is to pick your most important requests (say 5) and cycle through them all serially until you’ve done them all 12 times. “ 

They show this being accomplished with a simple bash script paired with curl. There is a more robust method in the form of “warmup documents”. 

You specify a warmup document in an HDF file like this:

cmd = 1
url = /var/etsy/current/bin/hhvm/warmup.php // script to execute
remote_host =
remote_port = 35100
headers { // headers to pass into HHVM
0 {
name = Accept
value = */*
1 {
name = Host
value =
2 {
name = User-Agent
value = Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11

To tell HHVM to execute that warmup document on startup, simply reference it like so:

Server {
WarmupRequests {
* = /var/etsy/current/bin/hhvm/warmup.hdf

This will execute /var/etsy/current/bin/hhvm/warmup.php between when the HHVM binary is executed and when the process accepts connections. It will only execute it once however, and HHVM will not JIT any code until after the twelfth request. To execute a warmup document 12 times simply reference it 12 times from the config file, like so:

Server {
WarmupRequests {
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf

Profiling HHVM with perf(1)

HHVM makes it really easy to profile PHP code. One of the most interesting ways is with Linux’s perf tool.

HHVM is a JIT that converts PHP code into machine instructions. Because these instructions, or symbols, are not in the HHVM binary itself, perf cannot automatically translate these symbols into functions names. HHVM creates an interface to aid in this translation. It takes the form of a file in /tmp/ named according to this template:

/tmp/perf-<pid of process>.map 

The first column in the file is the address of the start of that function in memory. The second column is the length of the function in memory. And the third column is the function to print in perf.

Perf looks up processes it has recorded by their pid in /tmp to find and load these files. (The pid map file needs to be owned by the user running perf report, regardless of the permissions set on the file.) 

If you run

sudo perf record -p <pid> -ag -e instructions -o /tmp/ -- sleep 20

perf will record all of the symbols being executed for the given pid and the amount of CPU time that symbol was responsible for on the CPU over a period of 20 seconds. It stores that data in /tmp/

Once you have gathered data from perf with a command such as the above, you can display that data interactively in the terminal using `perf report`.

perf report

Click to embiggen

This show us a list of the most expensive functions (in terms of instructions executed on the CPU) being executed. Functions prefixed with HPHP:: are functions built into the language runtime. For example HPHP::f_sort accounts for all calls the PHP code makes to sort(). Functions prefixed with PHP:: are programmer-defined PHP functions. Here we can see that 36% of all CPU time occurred in Api_Handler::respond(), for example. Using perf() to profile PHP code is powerful on its own, but having the ability to jump from a PHP function into an HPHP function allows you to see what parts of your codebase HHVM doesn’t handle efficiently. Using this process we were able to determine that sort() calls were slow when enable_zend_sorting was enabled. After patching it to be more efficient, we realized a significant CPU and performance win:

CPU drop

Median perf drop

This change resulted in an additional increase in throughput across our API cluster as well as improved response times.

HHVM Interactive Debugger

HHVM provides an interactive debugger called “hphpd”. hphpd works similarly to gdb: it is a command line based interactive debugger. 

$ hhvm -c /etc/php.d/etsy.ini -m debug bin/test.php
Welcome to HipHop Debugger!
Type "help" or "?" for a complete list of commands.
Program bin/test.php loaded. Type '[r]un' or 'ontinue' to go.

Here we set a breakpoint on a function:

hphpd> b Shutdown::registerApiFunctions()
Breakpoint 1 set upon entering Shutdown::registerApiFunctions()
But wont break until class Shutdown has been loaded.
Commence execution until we encounter a breakpoint:
hphpd> continue
Breakpoint 1 reached at Shutdown::registerApiFunctions() on line 101 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
100     public static function registerApiFunctions() {
101*        self::registerFunction(['Shutdown', 'apiShutdown']);
102     }

Step into that function:

hphpd> step
Break at Shutdown::registerFunction() on line 74 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
73     public static function registerFunction() {
74*        $callback = func_get_args();

Step over that function:

hphpd> next
Break at Shutdown::registerFunction() on line 76 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
76*        if (empty($callback)) {
77             $bt = new Dev_Backtrace();
hphpd> next
Break at Shutdown::registerFunction() on line 82 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
81         }
82*        if (!is_callable($callback[0])) {
83             $bt = new Dev_Backtrace();

After adding a few lines to your configuration file you can use this debugger on any code that executes in HHVM.

Lessons Learned from the Experiment

The process of migrating our API cluster to HHVM taught us a lot about HHVM as well as how to better perform such migrations in the future. The ability to clone HTTP traffic and tee it to a read-only test cluster allowed us to gain confidence in HHVM much more quickly than we could have otherwise. While HHVM proved to be rock-solid as a language runtime, extensions proved to be less battle-tested. We frequently encountered bugs and missing features in the MySQL, Memcached and OAuth extensions, among others. Finally it’s important to remember that HHVM is threaded, which can result in a weird interplay between the runtime and system calls. The resulting behavior can be very surprising.

HHVM met our expectations. We were able to realize a greater throughput on our API cluster, as well as improved performance. Buying fewer servers also means less waste and less power consumption in our data centers, which is important to Etsy as a Benefit Corporation.


You can follow Dan on Twitter @jazzdan.

Special thanks Sara Golemon, Paul Tarjan and Josh Watzman at Facebook. Extra special thanks to Keyur Govande and Adam Saponara at Etsy.


Q1 2015 Site Performance Report

Posted by on March 30, 2015 / 1 Comment

Spring has finally arrived, which means it’s time to share our Site Performance Report for the first quarter of 2015. Like last quarter, in this report, we’ve taken data from across an entire week in March and are comparing it with data from an entire week in December. Since we are constantly trying to improve our data reporting, we will be shaking things up with our methodology for the Q2 report. For backend performance, we plan to randomly sample the data throughout the quarter so it is more statistically sound and a more accurate representation of the full period of time.

We’ve split up the sections of this report among the Performance team and different members will be authoring each section. Allison McKnight is updating us on the server-side performance section, Natalya Hoota is covering the real user monitoring portion, and as a performance bootcamper, I will have the honor of reporting the synthetic front-end monitoring. Over the last three months, front-end and backend performance have remained relatively stable with some variations to specific pages such as, baseline, home, and profile. Now without further ado, let’s dive into the numbers.

Server-Side Performance – from Allison McKnight

Let’s take a look at the server-side performance for the quarter. These are times seen by real users (signed-in and signed-out). The baseline page includes code that is used by all of our pages but has no additional content.


Check that out! None of our pages got significantly slower (slower by at least 10% of their values from last quarter). We do see a 50 ms speedup in the homepage median and a 30 ms speedup in the baseline 95th percentile. Let’s take a look at what happened.

First, we see about a 25 ms speedup in the 95th percentile of the baseline backend time. The baseline page has three components: loading bootstrap/web.php, a bootstrap file for all web requests; creating a controller to render the page; and rendering the baseline page from the controller. We use StatsD, a tool that aggregates data and records it in Graphite, to graph each step we take to load the baseline page. Since we have a timer for each step, I was able to drill down to see that the upper bound for the bootstrap/web step dropped significantly in the end of January:

We haven’t been able to pin down the change that caused this speedup. The bootstrap file performs a number of tasks to set up infrastructure – for example, setting up logging and security checks – and it seems likely that an optimization to one of these processes resulted in a faster bootstrap.

We also see a 50 ms drop in the homepage median backend time. This improvement is from rolling out HHVM for our internal API traffic.

HHVM is a virtual machine developed at Facebook to run PHP and Hack, a PHP-like language developed by Facebook and designed to give PHP programmers access to language features that are unavailable in PHP. HHVM uses just-in-time (JIT) compilation to compile PHP and Hack into bytecode during runtime, allowing for optimizations such as code caching. Both HHVM and Hack are open-source.

This quarter we started sending all of our internal API v3 requests to six servers running HHVM and saw some pretty sweet performance wins. Overall CPU usage in our API cluster dropped by about 20% as the majority of our API v3 traffic was directed to the HHVM boxes; we expect we’ll see an even larger speedup when we move the rest of our API traffic to HHVM.

Time spent in API endpoints dropped. Most notably, we saw a speedup in the search listings endpoint (200 ms faster on the median and 100 ms faster on the 90th percentile) and the fetch listing endpoint (100 ms faster on the median and 50 ms faster on the 90th percentile).

Since these endpoints are used mainly in our native apps, mobile users will have seen a speed boost when searching and viewing listings. Desktop users also saw some benefits: namely, the median homepage backend time for signed-in users, whose homepages we personalize with listings that they might like, dropped by 95 ms. This is what caused the 50 ms drop in the median backend time for all homepage views this quarter.

The transition to using HHVM for our internal API requests was headed by Dan Miller on our Core Platform team. At Etsy, we like to celebrate the work done on different teams to improve Performance by naming a Performance Hero when exciting improvements are made. Dan was named the first Performance Hero of 2015 for his work on HHVM. Go, Dan!

To learn more about how we use HHVM at Etsy and the benefits that it’s brought us, you can see the slides from his talk HHVM at Etsy, which he gave at PHP UK 2015 Conference. A Code as Craft post about HHVM at Etsy will appear from him in the future, so keep checking back!

Synthetic Front-End Performance – from Kristyn Reith

Below is the synthetic front-end performance data for Q1. For synthetic testing, a third party simulates actions taken by a user and then continuously monitors these actions to generate performance metrics. For this report, the data was collected by Catchpoint, which runs tests every ten minutes on IE9 in New York, London, Chicago, Seattle and Miami. Catchpoint defines the webpage response metric as the time it takes from the request being issued until the last byte of the final element of the page is received. These numbers are all medians and here is the data for the week of March 8-15th 2015 compared to the week of December 15-22nd 2014.


To calculate error ranges for our median values, we use Catchpoint’s standard deviation. Based on these standard deviations, the only statistically significant performance regression we saw was for the homepage for both the start render and webpage response times. Looking further into this, we dug into the homepage’s waterfall charts and discovered that Catchpoint’s “response time” metrics are including page elements that load asynchronously. The webpage response time should not account for elements loaded after the document is considered complete. Therefore, this regression is actually no more than a measurement tooling problem and not representative of a real slowdown.

Based on these standard deviations, we saw several improvements. The most noteworthy of these are the start render and webpage response times for the listing page. After investigating potential causes for this performance win, we discovered that this was no more than an error in data collection on our end. The Etsy shop that owns the listing page that we use to collect data in Catchpoint had been put on vacation mode, which temporarily puts the shop “on hold” and hides listings, prior to us pulling the Q1 data. While on vacation mode, the listing for the listing page in question expired on March 7th. So all the data pulled for the week we measured in March does not represent the same version of the listing page that was measured in our previous report, since the expired listing page includes additional suggested items. To avoid having an error like this occur in the future, the performance team will be creating a new shop with a collection of listings, specifically designated for performance testing.

Although the synthetic data for this quarter may seem to suggest that there were major changes, it turned out that the biggest of these were merely errors in our data collection. As we note in the conclusion, we’re going to be overhauling a number of ways we gather data for these reports.

Real User Front-End Performance – from Natalya Hoota

As in our past reports, we are using real user monitoring (RUM) data from mPulse. Real user data, as opposed to synthetic measurements, is sent from users’ browsers in real time.


It does look like the overall trend is global increase in page load time. After a few examinations it appears that most of the slowdown is coming from the front end. A few things to note here – the difference is not significant (less than 10%) with an exception for homepage and profile page.

Homepage load time was affected slightly more than the rest due to two experiments with real time recommendations and page content grouping, both of which are currently ramped down. Profile page showed no outstanding increase in time for the median values; as for the long tail (95 percentile), however, there was a greater change for the worse.

Another interesting nugget that we found was that devices send a different set of metrics to mPulse based on whether their browsers support navigation timing. The navigation timing API was proposed by W3C on 2012, leading to major browsers gradually rolling in support for them. Notably, Apple added it to Safari last July, allowing RUM vendors better insight into users experience. For our data analysis it means the following: we should examine each navigation and resource timing metrics separately, since the underlying data sets are not identical.

In order to make a definitive conclusion, we would need to test statistical validity of that data. In the next quarter we are hoping to incorporate changes that will include better precision in our data collection, analysis and visualization.

Conclusion – from Kristyn Reith

The first quarter of 2015 has included some exciting infrastructure changes. We’ve already begun to see the benefits that have resulted from the introduction of HHVM and we are looking forward to seeing how this continues to impact performance as we transition the rest of our API traffic over.

Keeping with the spirit of exciting changes, and acknowledging the data collection issues we’ve discovered, we will be rolling out a whole new approach to this report next quarter. We will partner with our data engineering team to revamp the way we collect our backend data for better statistical analysis. We will also experiment with different methods of evaluation and visualization to better-represent the speed findings in the data. We’ve also submitted a feature request to Catchpoint to add an alert that’s only triggered if bytes *before* document complete have regressed. With these changes, we look forward to bringing you a more accurate representation of the data across the quarter, so please check back with us in Q2.

1 Comment

Re-Introducing Deployinator, now as a gem!

Posted by on February 20, 2015 / 1 Comment

If you aren’t familiar with Deployinator, it’s a tool we wrote to deploy code to We deploy code about 40 times per day. This allows us to push smaller changes we are confident about and experiment at a fast rate. Deployinator does a lot of heavy lifting for us. This includes updating source repositories on build machines, minifying/building javascript and css dependencies, kicking off automated tests and updating our staging environment before launching live. But Deployinator doesn’t just deploy, it also manages deploys for a myriad of internal tools, such as our Virtual Machine provisioning system, and can even deploy itself. Within Deployinator, we call each of these independent deployments “stacks”. Deployinator includes a number of helper modules that make writing deployment stacks easy. Our current modules provide helpers for versioning, git operations, and for utilizing DSH. Deployinator works so well for us we thought it best to share. 

Four years ago we open sourced deployinator for OSCON. At the time we created a new project on github with the Etsy related code removed and forked it internally. This diverged and was difficult to maintain for a few reasons. The original public release of Deployinator mixed core and stack related code, creating a tightly coupled codebase; Configuration and code that was committed to our internal fork could not be pushed to public github. Naturally, every internal commit that included private data invariably included changes to the Deployinator core as well. Untangling the public and private bits made merging back into the public fork difficult and over time impossible. If (for educational reasons) you are interested the old code it is still available here.

Today we’d like to announce our re-release of Deployinator as an open source ruby gem (rubygems link).  We built this release with open-source in mind from the start by changing our internal deployinator repository (renamed to DeployinatorStacks for clarity) to include an empty gem created on our public github. Each piece of core deployinator code was then individually untangled and moved into the gem. Since we now depend on the same public Deployinator core we should no longer have problems keeping everything in sync.

While in the process of migrating Deployinator core into the gem it became apparent that we needed a way to hook into common functionality to extend it for our specific implementations. For example, we use graphite to record duration of deploys and the steps within. An example of some of the steps we track are template compilations, javascript and css asset building and rsync times. Since the methods to complete these steps are entirely within the gem, implementing a plugin architecture allows everyone to extend core gem functionality without needing a pull request merged. Our README explains how to create deployment stacks using the gem and includes an example to help you get up and running.

(Example of how deployinator looks with many stacks)

Major Changes

Deployinator now comes bundled with a simple service to tail running deploys logs to the front end. This replaces some overly complicated streaming middleware that was known to have problems. Deploys are now separate unix processes with descriptive proc titles. Before they were hard to discern requests running under your web server. The combination of these two things decouples deploys from the web request allowing uninterrupted flow in the case of network failures or accidental browser closings. Having separate processes also enables operators to monitor and manipulate deploys using traditional command line unix tools like ps and kill.

This gem release also introduces some helpful namespacing. This means we’re doing the right thing now.  In the previous open source release all helper and stack methods were mixed into every deploy stack and view. This caused name collisions and made it hard to share code between deployment stacks. Now helpers are only mixed in when needed and stacks are actual classes extending from a base class.

We think this new release makes Deployinator more intuitive to use and contribute to and encourage everyone interested to try out the new gem. Please submit feedback as github issues and pull requests. The new code is available on our github. Deployinator is at the core of Etsy’s development and deployment model and how it keeps these fast. Bringing you this release embodies our generosity of spirit in engineering principle. If this sort of work interests you, our team is hiring.

1 Comment