Q2 2015 Site Performance Report

Posted by on July 13, 2015 / 2 Comments

We are kicking off the third quarter of 2015, which means it’s time to update you on how Etsy’s performance changed in Q2. Like in our last report, we’ve taken data from across an entire week in May and are comparing it with the data from an entire week in March. We’ve mixed things up in this report to better visualize our data and the changes in site speed:

As in the past, we’ve split up the sections of this report among members of our performance team. Allison McKnight will be reporting on the server-side portion, Kristyn Reith will be covering the synthetic front-end section and Natalya Hoota will be providing an update on the real user monitoring section. We have to give a special shout out to our bootcamper Emily Smith, who spent a week working with us and digging into the synthetic changes that we saw. So without further ado, let’s take a look at the numbers.

Server-Side Performance

Serverside_HomeListingShopBaseline

Taking a look at our backend performance, we see that the quartile boundaries for home, listing, shop, and baseline pages haven’t changed much between Q1 and Q2. We see a change in the outliers for the shop and baseline pages – the outliers are more spread out (and the largest outlier is higher) in this quarter compared to the last quarter. For this report, we are going to focus on analyzing only changes in the quartile boundaries while we work on honing our outlier analysis skills and tools for future reports.

Serverside_cart

On the cart page, we see the top whisker and outliers move down. During the week in May when we pulled this data, we were running an experiment that added pagination to the cart. Some users have many items in their carts; these items take a long time to load on the backend. By limiting the number of items that we load on each cart page, we improve the backend load time for these users especially. If we were to look at the visit data in another format, we might see a bimodal distribution where users exposed to this experiment would have clearly different performance than users who didn’t see the experiment. Unfortunately, box plots limit our view on whether user experience could be statistically divided into two separate categories (i.e. multimodal distribution). We’re happy to say that we launched this feature in full earlier this week!

Serverside_search

This quarter, the Search team experimented with new infrastructure that should make desktop and mobile experience more streamlined. On the backend, this translated into a slightly higher median time with an improvement for the slower end of users: the top whisker moved down from 511 ms to 447 ms, and the outliers moved down with it. The bottom whisker and the third quartile also moved down slightly while the first quartile moved up.

Taking a look at our timeseries record of search performance across the quarter, we see that a change was made that greatly impacted slower loads and had a smaller impact on median loads:

Serverside_search2

Synthetic Start Render and Webpage Response

Most things look very stable quarter over quarter for synthetic measurements of our site’s performance.

Synthetic_baselinehomecart

As we only started our synthetic measurements for the cart page in May, we do not have quarter-over-quarter data.

Synthetic_search

You can see that the start render time of the search page has gotten slower this quarter but that the webpage response time for search sped up. The regression in start render was caused by experiments being run by our search team, while the improvement in the webpage response time for search resulted from the implementation of the Etsy styleguide toolkit. The toolkit is a set of fully responsive components and utility classes that make layout fast and consistent. Switching to the new toolkit decreased the amount of custom CSS that we deliver on search pages by 85%.

Synthetic_listingshop

As noted above, we are using a slightly different date range for the listing and shop data so that we can compare apples to apples. Taking a look at the webpage response time box plots, we see improvements to both the listing and shop pages. The faster webpage response time for the listing page can be attributed to an experiment running that reduced the page weight by altering the font-weights. The improvement to shop’s webpage response time is the result of migrating to a new tag manager that is used to track the performance of outside advertising campaigns. This migration allowed us to fully integrate third party platforms in new master tags which reduced the number of JS files for campaigns.

Real User Page Load Time

The software we use to measure our real user measurements, mPulse, was updated in the middle of this quarter, leading to a number of improvements in timer calculation and data collection and validation. Expectedly, we saw a much more comprehensive pattern in data outliers (i.e., values falling far above and below the average) on all pages, and are excited for this cleaner data set.

RealUserQ2

Since Q1 and Q2 data was collected with different versions of the real user monitoring software, it would not be scientifically accurate to make any conclusions about our user experiences this quarter relative to the previous one. It definitely looks like an overall, though slight, improvement sitewide, a trend which we hope to keep throughout next quarter.

Conclusion

Although we saw a few noteworthy changes to individual pages, things remained fairly stable in Q2. Using box plots for this report helped us provide a more holistic representation of the data distribution, range and quality by looking at the quartile ranges and the outliers. For next quarter’s report we are really excited about the opportunity to continue exploring new, more efficient ways to visualize the quarterly data.

2 Comments

Open Source Spring Cleaning

Posted by on July 9, 2015 / 1 Comment

At Etsy, we are big fans of Open Source. Etsy as it is wouldn’t exist without the myriad of people who have solved a problem and published their code under an open source license. We serve etsy.com through the Apache web server running on Linux, our server-side code is mostly written in PHP, we store our data in MySQL, we track metrics using graphs from from Ganglia and Graphite to keep us up to date, and use Nagios to monitor the stability of our systems. And these are only the big examples. In every nook and cranny of our technology stack you can find Open Source code.

Part of everyone’s job at Etsy is what we call “Generosity of Spirit”, which means giving back to the industry. For engineers that means that we strive to give a talk at a conference, write a blog post on this blog or contribute to open source at least once every year. We love to give back to the Open Source community when we’ve created a solution to a problem that we think others might benefit from.

Maintenance and Divergence

This has led to many open sourced projects on our GitHub page and a continuing flow of contributions from our engineers and the Open Source community. We are not shy about open sourcing core parts of our technology stack. We are publicly developing our deployment system, metrics collector, team on-call management tool and our code search tool. We even open sourced the crucial parts of our atomic deployment system. And it has been very rewarding to receive bug fixes and features from the wider community that make our software more mature and stable.

As we open sourced more projects, it became tempting to run an internal fork of the project when we wanted to add new features quickly. These projects with internal forks quickly diverged from the open sourced versions. This meant the work to maintain the project was doubled. Anything fixed or changed internally had to be fixed or changed externally, and vice versa. In a busy engineering organization, the internal version usually was a priority over the public one. Looking at our GitHub page, it wasn’t clear – even to an Etsy engineer – whether or not we were actively maintaining a given project.

We end up with public projects that hadn’t been committed to in years. Open sourcers who were taking the time to file a bug report and didn’t get an answer on the issue, sometimes for years, which didn’t instill confidence in potential users. No one could tell whether the project is a maintained piece of software or a proof of concept that won’t get any updates.

Going forward

We want to do better by the Open Source community, since we’ve benefited so much from existing Open Source Software. We did a bit of Open Source spring cleaning to bring more clarity to the state of our open source projects. Going forward our projects will be clearly labeled as either maintained, not maintained, or archived.

Maintained

Maintained projects are the default and are not specifically labeled as such. For maintained projects, we’re either running the open source version internally or currently working on getting our internal version back in sync with the public version. We already did this for our deployment tool in the past. We are actively working on any maintained projects: merging or commenting on pull requests, answering bug reports, and adding new features.

Not Maintained

We also have a few of projects that haven’t seen public updates in years. Usually this is because we haven’t found a way to make the project configurable in a way such that we can run the public version internally without slowing down our development cycles. However the code as it is available serves as a great proof of concept and illustrates how we approach the problem. Or it might have been a research project that we have abandoned because it turned out to not really solve our problem in the long run but still wanted to share what we tried. Those projects will just stay the way they are and likely will rarely receive any updates. We will turn off issues and pull requests on those and make it very clear in the README that this is a proof of concept only.

Archived

We also have a number of projects that we have open sourced because we were using them at one time but have since abandoned altogether. We have likely found that there exists a better solution to the problem or that the solution hasn’t proven useful in the long run. In those cases we will push a commit to the master branch that removes all code and only leaves the README with a description of the project and its status. The README will link to the last commit containing actual code. This way the code doesn’t just vanish, but the project is clearly not active. Those projects will also have issues and pull requests turned off.

In addition to the archival of those projects we will also start to delete forks of other Open Source projects that we’ve made at some point, but aren’t actively maintaining.

Closing thoughts

We have learned a lot about maintaining Open Source projects over the last couple of years. The main lesson we want to share is that it’s essential to use the Open Source version internally to provide a good experience for other Open Source developers who want to use our software. We strive to always learn and get better at everything we do. If you’ve been waiting for us to respond to an issue or merge a pull request, hopefully this will give you more insight into what has been going on and why it took so long for us to respond, and we hope that our new project labeling system will also give you more clarity about the state of our open source projects. In order to be good open source citizens we want to always do our best to give back in a way that is helpful for everyone. And a little spring cleaning is always a good thing. Even if it’s technically summer already.

You can follow Jared on Twitter here and Daniel
here.

1 Comment

Four Months of statsd-jvm-profiler: A Retrospective

Posted by on May 12, 2015 / 10 Comments

It has been about four months since the initial open source release of statsd-jvm-profiler.  There has been a lot of development on it in that time, including the addition of several major new features.  Rather than just announcing exciting new things, this is a good opportunity to reflect on what has come of the project since open-sourcing it and how these new features came to be.

External Adoption

It has been very exciting to see statsd-jvm-profiler being adopted outside of Etsy, and we’ve learned a lot from talking to these new users.  It was initially built for Scalding, and many of the people who’ve tried it out have been profiling Scalding jobs.  However, I have spoken to people who are using it to profile jobs written in other MapReduce APIs, such as Scrunch, as well as pure MapReduce jobs.  Moreover, others have used it with tools in the broader Hadoop ecosystem, such as Spark or Storm.  Most interestingly, however, there have been a few people using statsd-jvm-profiler outside of Hadoop entirely, on enterprise Java applications.  There was never anything Hadoop-specific about the profiling functionality, but it was very gratifying to see that they were able to apply it unchanged to a domain so far from the initial use case.

Contributions

One of the major benefits of open-sourcing a project is the ability to accept contributions from the community.  This has definitely been helpful for statsd-jvm-profiler.  There have been several pull requests accepted, both fixing bugs and adding new features.  Also, there are some active forks that the authors hopefully decide to contribute back.  The community of contributors is small, but the contributions have been valuable.  Questions about how to contribute were common, however, so the project now has contribution guidelines.

An unexpected aspect of community involvement in the project has been the amount of questions and suggestions that have come via email instead of through Github.  In hindsight setting up a mailing list for the project would have been a good idea; at the time of the initial release I had thought the utility of a mailing list for the project was low.  I have since created a mailing list for the project, but it would have been useful to have those original emails be publically available.  Nevertheless, the suggestions have been very helpful.  It would be amazing if everyone who had suggested improvements also sent pull requests, but I recognize that not everyone is willing or able to do so.  Even so I am grateful that people have been willing to contribute to the project in this way.

Internal Use

The use of statsd-jvm-profiler within Etsy has been less successful than it was externally.  We use Graphite as the backend for StatsD and as we started to use the profiler more, we began to have problems with Graphite.  Someone would start to profile a job, thus creating a fairly large number of new metrics.  This would sometimes cause Graphite to lock up and become unresponsive.  We put in some workarounds, including rate limiting the metric creation and configurable filtering of the metrics produced by CPU profiling, but these were ultimately only beneficial for smaller jobs.  Graphite is an important part of our infrastructure beyond statsd-jvm-profiler, so this was a bad situation.  Being able to profile and improve the performance of our Hadoop jobs is important, but not breaking critical pieces of infrastructure is more important.  The issues with Graphite meant that the ability to use the profiler was heavily restricted.  This was the exact opposite of the goal of easy to use, accessible profiling that motivated the creation of statsd-jvm-profiler.  Finally after breaking Graphite yet again the profiler was disabled entirely.  The project admittedly languished for about a month.  Since we weren’t using it internally, there was less incentive to continue improving it.

New Features

statsd-jvm-profiler was in an interesting state at this point.  There were still external users and internal interest, but it was too risky for us to actually use it.  Rather than abandon the project, I set out to bring to a better state, one where we could use it without risk to other parts of our production infrastructure.  The contributions from the community were incredibly helpful at this point.  Ultimately the new features were all developed internally, but the suggestions and feedback from the community provided lots of ideas for what to change that would both meet our internal needs as well as providing value externally.  As a result we’re able to use it internally again without DDOSing our Graphite infrastructure.

Multiple Metrics Backends

The idea of supporting multiple backends for metrics collection instead of just StatsD was considered during initial development, but was discarded to keep the profiling data flowing through StatsD and Graphite.  We use these extensively at Etsy, and the theory was that by keeping the profiling data in a familiar tool would make it more accessible.  In practice, however, the sheer volume of data produced from all the jobs we wanted to profile tended to overwhelm our production infrastructure.

Also, supporting different backends for metric collections was the most commonly requested feature from the community, and there were a lot of different suggestions for which to use.  StatsD is still the default backend, but it is configurable through the reporter argument to the profiler.  We are trying out InfluxDB as the first new backend.  There are a couple of reasons why it was selected.  First, statsd-jvm-profiler produces very bursty metrics in a very deep hierarchy.  This is fairly different than the normal use case for Graphite and we came to realise that Graphite was not the right tool for the job.  InfluxDB was very easy to set up and had better support for such metrics without needing any configuration.  Also, InfluxDB has a much richer, SQL-like query language.  With Graphite we had been dumping all of the metrics to a file and processing that, but InfluxDB’s query language allows for more complex visualization and analysis of the profiling data without needing the intermediate step.  So far InfluxDB has been working well.  Moreover, since it is independent from the rest of our production infrastructure only statsd-jvm-profiler will be affected if problems do arise.

Furthermore, the refactoring done to support InfluxDB in addition to StatsD has created a framework for supporting any number of backends.  This provides a great avenue for community contributions to support some other metric collection service.

New Dashboard

Better tooling for visualizing the data produced by profiling was another common feature request.  The initial release included a script for producing flame graphs, but it was somewhat hard to use.  Also, we had otherwise been using our internal framework for dashboards to get data from Graphite.  With the move to InfluxDB this wouldn’t be possible anymore.  As such we also needed a better visualization tool internally.

To that end statsd-jvm-profiler now includes a simple dashboard.  It is a Node.js application and pulls data from InfluxDB, leveraging its powerful query language.  It expects the metric prefix configured for the profiler to follow a certain pattern, but then you can select a particular process for which to view the profiling data:

Selecting a job from the statsd-jvm-profiler dashboard

From there it will display memory usage over the course of profiling:

Memory metrics

And it will also display the count of garbage collections and the total time spent in GC:

GC metrics

It can also produce an interactive flame graph:

Example flame graph

Embedded HTTP Server

Finally, the ability to disable CPU profiling after execution had started was the other most common feature request.  There was an option to disable it from the start, but not after the profiler was already running.  Both this and the ability to inspect some of the profiler state would have been useful for us while debugging the issues that arose with Graphite initially.  To support both of these features, statsd-jvm-profiler now has an embedded HTTP server.  By default this is accessible from port 5005 on the machine the application being profiled is running on, but this choice of port can be configured with the httpPort option to the profiler.  At present this both exposes some simple information about the profiler’s state and allows disabling collection of CPU or memory metrics.  Adding additional features here is another great place for community contributions.

Conclusions

Unequivocally statsd-jvm-profiler is better for having been open-sourced.  There has been a lot of activity on the project in the months since its initial public release.  It has seen adoption in a variety of use cases, including some quite different from those for which it was initially designed.  There has been a small but helpful community of contributors, both through code and through feedback and suggestions for the project.  When we hit issues using the project internally, the feedback from the community aligned very well with what we needed to get the project back on track and gave us momentum to keep going..

Going forward keeping up contributions from the community is definitely important to the success of the project.  There is a mailing list now, contribution guidelines, as well as some suggestions for how to contribute.  If you’d like to get involved or just try out statsd-jvm-profiler, it is available on Github!

10 Comments

Experimenting with HHVM at Etsy

Posted by on April 6, 2015 / 34 Comments

In 2014 Etsy’s infrastructure group took on a big challenge: scale Etsy’s API traffic capacity 20X. We launched many efforts simultaneously to meet the challenge, including a migration to HHVM after it showed a promising increase in throughput. Getting our code to run on HHVM was relatively easy, but we encountered many surprises as we gained confidence in the new architecture.

What is HHVM?

Etsy Engineering loves performance, so when Facebook announced the availability of the HipHop Virtual Machine for PHP, its reported leap in performance over current PHP implementations got us really excited.

HipHop Virtual Machine (HHVM) is an open-source virtual machine designed for executing programs written in PHP. HHVM uses a just-in-time (JIT) compilation approach to achieve superior performance while maintaining the development flexibility that PHP provides.

This post focuses on why we became interested in HHVM, how we gained confidence in it as a platform, the problems we encountered and the additional tools that HHVM provides. For more details on HHVM, including information on the JIT compiler, watch Sara Golemon and Paul Tarjan’s presentation from OSCON 2014.

Why HHVM?

In 2014 engineers at Etsy noticed two major problems with how we were building mobile products. First, we found ourselves having to rewrite logic that was designed for being executed in a web context to be executed in an API context. This led to feature drift between the mobile and web platforms as the amount of shared code decreased.

The second problem was how tempting it became for engineers to build lots of general API endpoints that could be called from many different mobile views. If you use too many of these endpoints to generate a single view on mobile you end up degrading that view’s performance. Ilya Grigorik’s “Breaking the 1000ms Time to Glass Mobile Barrier” presentation explains the pitfalls of this approach for mobile devices. To improve performance on mobile, we decided to create API endpoints that were custom to their view. Making one large API request is much more efficient than making many smaller requests. This efficiency cost us some reusability, though. Endpoints designed for Android listing views may not have all the data needed for a new design in iOS. The two platforms necessitate different designs in order to create a product that feels native to the platform. We needed to reconcile performance and reusability.

To do this, we developed “bespoke endpoints”. Bespoke endpoints aggregate smaller, reusable, cacheable REST endpoints. One request from the client triggers many requests on the server side for the reusable components. Each bespoke endpoint is specific to a view.

Consider this example listing view. The client makes one single request to a bespoke endpoint. That bespoke endpoint then makes many requests on behalf of the client. It aggregates the smaller REST endpoints and returns all of the data in one response to the client.

Bespoke Endpoint

Bespoke endpoints don’t just fetch data on behalf of the client, they can also do it concurrently. In the example above, the bespoke endpoint for the web view of a listing will fetch the listing, its overview, and the related listings simultaneously. It can do this thanks to curl_multi. Matt Graham’s talk “Concurrent PHP in the Etsy API” from phpDay 2014 goes into more detail on how we use curl_multi. In a future post we’ll share more details about bespoke endpoints and how they’ve changed both our native app and web development.

This method of building views became popular internally. Unfortunately, it also came with some drawbacks.

API traffic growth compared to Web traffic growth

Now that web pages had the potential to hit dozens of API endpoints, traffic on our API cluster grew more quickly than we anticipated. But that wasn’t the only problem.

Bootstrap Time Visualized

This graph represents all the concurrent requests that take place when loading the Etsy homepage. Between the red bars is work that is duplicated across all of the fanned out requests. This duplicate work is necessary because of the shared-nothing process architecture of PHP. For every request, we need to build the world: fetch the signed-in user, their settings, sanitize globals and so on. Although much of this duplicated work is carried out in parallel, the fan-out model still causes unnecessary work for our API cluster. But it does improve the observed response time for the user.

After considering many potential solutions to this problem, we concluded that trying to share state between processes in a shared-nothing architecture would inevitably end in tears. Instead, we decided to try speeding up all of our requests significantly, including the duplicated bootstrap work. HHVM seemed well-suited to the task. If this worked, we’d increase throughput on our API cluster and be able to scale much more efficiently.

Following months of iterations, improvements and bug fixes, HHVM now serves all of the fan-out requests for our bespoke endpoints. We used a variety of experiments to gain confidence in HHVM and to discover any bugs prior to deploying it in production.

The Experiments

Minimum Viable Product

The first experiment was simple: how many lines of PHP code do we have to comment out before HHVM will execute an Etsy API endpoint? The results surprised us. We only encountered one language incompatibility. All of the other problems we ran into were with HHVM extensions. There were several incompatibilities with the HHVM memcached extension, all of which we have since submitted pull requests for.

Does it solve our problem?

We then installed both PHP 5.4 and HHVM on a physical server and ran a synthetic benchmark. This benchmark randomly splayed requests across three API endpoints that were verified to work in HHVM, beginning at a rate of 10 requests per second and ramping up to 280 requests per second. The throughput results were promising.

The little green line at the bottom is HHVM response time

The little green line at the bottom is HHVM response time

Our PHP 5.4 configuration began to experience degraded performance at about 190 requests per second, while the same didn’t happen to HHVM until about 270 requests per second. This validated our assumption that HHVM could lead to higher throughput which would go a long way towards alleviating the load we had placed on our API cluster.

Gaining Confidence

So far we had validated that HHVM could run the Etsy API (at least with a certain amount of work) and that doing so would likely lead to increase in throughput. Now we had to become confident that HHVM could run etsy.com correctly. We wanted to verify that responses returned from HHVM were identical to those returned by PHP. In addition our API’s full automated test suite and good old-fashioned manual testing we also turned to another technique: teeing traffic.

You can think of “tee” in this sense like tee on the command line. We wrote an iRule on our f5 load balancer to clone HTTP traffic destined for one pool and send it to another. This allowed us to take production traffic that was being sent to our API cluster and also send it onto our experimental HHVM cluster, as well as an isolated PHP cluster for comparison.

This proved to be a powerful tool. It allowed us to compare performance between two different configurations on the exact same traffic profile.

Note that this is on powerful hardware.

140 rps peak. Note that this is on powerful hardware.

On the same traffic profile HHVM required about half as much CPU as PHP did. While this wasn’t the reduction seen by the HHVM team, who claimed a third as much CPU should be expected, we were happy with it. Different applications will perform differently on HHVM. We suspect the reason we didn’t see a bigger win is that our internal API was designed to be as lightweight as possible. Internal API endpoints are primarily responsible for fetching data, and as a result tend to be more IO bound than others. HHVM optimizes CPU time, not IO time.

While teeing boosted our confidence in HHVM there were a couple hacks we had to put in place to get it to work. We didn’t want teed HTTP requests generating writes in our backend services. To that end we wrote read-only mysql, memcached and redis interfaces to prevent writes. As a result however, we weren’t yet confident that HHVM would write data correctly, or write the correct data. 

Employee Only Traffic

In order to gain confidence in that area we configured our bespoke endpoints to send all requests to the HHVM cluster if the user requesting the page was an employee. This put almost no load on the cluster, but allowed us to ensure that HHVM could communicate with backend services correctly. 

At this point we encountered some more incompatibilities with the memcached extension. We noticed that our API rate limiter was never able to find keys to decrement. This was caused by the decrement function being implemented incorrectly in the HHVM extension. In the process of debugging this we noticed that memcached was always returning false for every request HHVM made to it. This turned out to be a bug in the client-side hashing function present in HHVM. What we learned from this is that while the HHVM runtime is rock-solid, a lot of the included extensions aren’t. Facebook thoughtfully wrote a lot of the extensions specifically for the open source release of HHVM. However, many of them are not used internally because Facebook has their own clients for memcached and MySQL, and as a result have not seen nearly as much production traffic as the rest of the runtime. This is important to keep in mind when working with HHVM. We expect this situation will improve as more and more teams test it out and contribute patches back to the project, as we at Etsy will continue to do.

After resolving these issues it came time to slowly move production traffic from the PHP API cluster to the HHVM API cluster.

Slow Ramp Up

As we began the slow ramp in production we noticed some strange timestamps in the logs:

[23/janv./2015:22:40:32 +0000]

We even saw timestamps that looked like this:

[23/ 1月/2015:23:37:56]

At first we thought we had encountered a bug with HHVM’s logging system. As we investigated we realized the problem was more fundamental than that.

At Etsy we use the PHP function setlocale() to assist in localization. During a request, after we load a user we call setlocale() to set their locale preferences accordingly. The PHP function setlocale() is implemented using the system call setlocale(3). This system call is process-wide, affecting all the threads in a process. Most PHP SAPIs are implemented such that each request is handled by exactly one process, with many processes simultaneously handling many requests. 

HHVM is a threaded SAPI. HHVM runs as a single process with multiple threads where each thread is only handling exactly one request. When you call setlocale(3) in this context it affects the locale for all threads in that process. As a result, requests can come in and trample the locales set by other requests as illustrated in this animation.

locale overwriting

We have submitted a pull request re-implementing the PHP setlocale() function using thread-local equivalents. When migrating to HHVM it’s important to remember that HHVM is threaded, and different from most other SAPIs in common use. Do an audit of extensions you’re including and ensure that none of them cause side effects that could affect the state of other threads.

Release!

After rolling HHVM out to just the internal API cluster we saw a noticeable improvement in performance across several endpoints.

public_search_listings

HHVM vs PHP on Etsy Internal API

It’s Not Just Speed

In the process of experimenting with HHVM we discovered a few under-documented features that are useful when running large PHP deployments.

Warming up HHVM

The HHVM team recommends that you warm up your HHVM process before having it serve production traffic:

“The cache locality of the JITted code is very important, and you want all your important endpoints code to be located close to each other in memory. The best way to accomplish this is to pick your most important requests (say 5) and cycle through them all serially until you’ve done them all 12 times. “ 

They show this being accomplished with a simple bash script paired with curl. There is a more robust method in the form of “warmup documents”. 

You specify a warmup document in an HDF file like this:

cmd = 1
url = /var/etsy/current/bin/hhvm/warmup.php // script to execute
remote_host = 127.0.0.1
remote_port = 35100
headers { // headers to pass into HHVM
0 {
name = Accept
value = */*
}
1 {
name = Host
value = www.etsy.com
}
2 {
name = User-Agent
value = Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11
}
}

To tell HHVM to execute that warmup document on startup, simply reference it like so:

Server {
WarmupRequests {
* = /var/etsy/current/bin/hhvm/warmup.hdf
}
}

This will execute /var/etsy/current/bin/hhvm/warmup.php between when the HHVM binary is executed and when the process accepts connections. It will only execute it once however, and HHVM will not JIT any code until after the twelfth request. To execute a warmup document 12 times simply reference it 12 times from the config file, like so:

Server {
WarmupRequests {
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
* = /var/etsy/current/bin/hhvm/warmup.hdf
}
}

Profiling HHVM with perf(1)

HHVM makes it really easy to profile PHP code. One of the most interesting ways is with Linux’s perf tool.

HHVM is a JIT that converts PHP code into machine instructions. Because these instructions, or symbols, are not in the HHVM binary itself, perf cannot automatically translate these symbols into functions names. HHVM creates an interface to aid in this translation. It takes the form of a file in /tmp/ named according to this template:

/tmp/perf-<pid of process>.map 

The first column in the file is the address of the start of that function in memory. The second column is the length of the function in memory. And the third column is the function to print in perf.

Perf looks up processes it has recorded by their pid in /tmp to find and load these files. (The pid map file needs to be owned by the user running perf report, regardless of the permissions set on the file.) 

If you run

sudo perf record -p <pid> -ag -e instructions -o /tmp/perf.data -- sleep 20

perf will record all of the symbols being executed for the given pid and the amount of CPU time that symbol was responsible for on the CPU over a period of 20 seconds. It stores that data in /tmp/perf.data.

Once you have gathered data from perf with a command such as the above, you can display that data interactively in the terminal using `perf report`.

perf report

Click to embiggen

This show us a list of the most expensive functions (in terms of instructions executed on the CPU) being executed. Functions prefixed with HPHP:: are functions built into the language runtime. For example HPHP::f_sort accounts for all calls the PHP code makes to sort(). Functions prefixed with PHP:: are programmer-defined PHP functions. Here we can see that 36% of all CPU time occurred in Api_Handler::respond(), for example. Using perf() to profile PHP code is powerful on its own, but having the ability to jump from a PHP function into an HPHP function allows you to see what parts of your codebase HHVM doesn’t handle efficiently. Using this process we were able to determine that sort() calls were slow when enable_zend_sorting was enabled. After patching it to be more efficient, we realized a significant CPU and performance win:

CPU drop

Median perf drop

This change resulted in an additional increase in throughput across our API cluster as well as improved response times.

HHVM Interactive Debugger

HHVM provides an interactive debugger called “hphpd”. hphpd works similarly to gdb: it is a command line based interactive debugger. 

$ hhvm -c /etc/php.d/etsy.ini -m debug bin/test.php
Welcome to HipHop Debugger!
Type "help" or "?" for a complete list of commands.
 
Program bin/test.php loaded. Type '[r]un' or 'ontinue' to go.

Here we set a breakpoint on a function:

hphpd> b Shutdown::registerApiFunctions()
Breakpoint 1 set upon entering Shutdown::registerApiFunctions()
But wont break until class Shutdown has been loaded.
Commence execution until we encounter a breakpoint:
hphpd> continue
Breakpoint 1 reached at Shutdown::registerApiFunctions() on line 101 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
100     public static function registerApiFunctions() {
101*        self::registerFunction(['Shutdown', 'apiShutdown']);
102     }

Step into that function:

hphpd> step
Break at Shutdown::registerFunction() on line 74 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
73     public static function registerFunction() {
74*        $callback = func_get_args();
75

Step over that function:

hphpd> next
Break at Shutdown::registerFunction() on line 76 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
75
76*        if (empty($callback)) {
77             $bt = new Dev_Backtrace();
 
hphpd> next
Break at Shutdown::registerFunction() on line 82 of /home/dmiller/development/Etsyweb/phplib/Shutdown.php
81         }
82*        if (!is_callable($callback[0])) {
83             $bt = new Dev_Backtrace();

After adding a few lines to your configuration file you can use this debugger on any code that executes in HHVM.

Lessons Learned from the Experiment

The process of migrating our API cluster to HHVM taught us a lot about HHVM as well as how to better perform such migrations in the future. The ability to clone HTTP traffic and tee it to a read-only test cluster allowed us to gain confidence in HHVM much more quickly than we could have otherwise. While HHVM proved to be rock-solid as a language runtime, extensions proved to be less battle-tested. We frequently encountered bugs and missing features in the MySQL, Memcached and OAuth extensions, among others. Finally it’s important to remember that HHVM is threaded, which can result in a weird interplay between the runtime and system calls. The resulting behavior can be very surprising.

HHVM met our expectations. We were able to realize a greater throughput on our API cluster, as well as improved performance. Buying fewer servers also means less waste and less power consumption in our data centers, which is important to Etsy as a Benefit Corporation.

 

You can follow Dan on Twitter @jazzdan.

Special thanks Sara Golemon, Paul Tarjan and Josh Watzman at Facebook. Extra special thanks to Keyur Govande and Adam Saponara at Etsy.

34 Comments

Q1 2015 Site Performance Report

Posted by on March 30, 2015 / 1 Comment

Spring has finally arrived, which means it’s time to share our Site Performance Report for the first quarter of 2015. Like last quarter, in this report, we’ve taken data from across an entire week in March and are comparing it with data from an entire week in December. Since we are constantly trying to improve our data reporting, we will be shaking things up with our methodology for the Q2 report. For backend performance, we plan to randomly sample the data throughout the quarter so it is more statistically sound and a more accurate representation of the full period of time.

We’ve split up the sections of this report among the Performance team and different members will be authoring each section. Allison McKnight is updating us on the server-side performance section, Natalya Hoota is covering the real user monitoring portion, and as a performance bootcamper, I will have the honor of reporting the synthetic front-end monitoring. Over the last three months, front-end and backend performance have remained relatively stable with some variations to specific pages such as, baseline, home, and profile. Now without further ado, let’s dive into the numbers.

Server-Side Performance – from Allison McKnight

Let’s take a look at the server-side performance for the quarter. These are times seen by real users (signed-in and signed-out). The baseline page includes code that is used by all of our pages but has no additional content.

q1_2015_server

Check that out! None of our pages got significantly slower (slower by at least 10% of their values from last quarter). We do see a 50 ms speedup in the homepage median and a 30 ms speedup in the baseline 95th percentile. Let’s take a look at what happened.

First, we see about a 25 ms speedup in the 95th percentile of the baseline backend time. The baseline page has three components: loading bootstrap/web.php, a bootstrap file for all web requests; creating a controller to render the page; and rendering the baseline page from the controller. We use StatsD, a tool that aggregates data and records it in Graphite, to graph each step we take to load the baseline page. Since we have a timer for each step, I was able to drill down to see that the upper bound for the bootstrap/web step dropped significantly in the end of January:
Bootstrap_2

We haven’t been able to pin down the change that caused this speedup. The bootstrap file performs a number of tasks to set up infrastructure – for example, setting up logging and security checks – and it seems likely that an optimization to one of these processes resulted in a faster bootstrap.

We also see a 50 ms drop in the homepage median backend time. This improvement is from rolling out HHVM for our internal API traffic.

HHVM is a virtual machine developed at Facebook to run PHP and Hack, a PHP-like language developed by Facebook and designed to give PHP programmers access to language features that are unavailable in PHP. HHVM uses just-in-time (JIT) compilation to compile PHP and Hack into bytecode during runtime, allowing for optimizations such as code caching. Both HHVM and Hack are open-source.

This quarter we started sending all of our internal API v3 requests to six servers running HHVM and saw some pretty sweet performance wins. Overall CPU usage in our API cluster dropped by about 20% as the majority of our API v3 traffic was directed to the HHVM boxes; we expect we’ll see an even larger speedup when we move the rest of our API traffic to HHVM.

Time spent in API endpoints dropped. Most notably, we saw a speedup in the search listings endpoint (200 ms faster on the median and 100 ms faster on the 90th percentile) and the fetch listing endpoint (100 ms faster on the median and 50 ms faster on the 90th percentile).

Since these endpoints are used mainly in our native apps, mobile users will have seen a speed boost when searching and viewing listings. Desktop users also saw some benefits: namely, the median homepage backend time for signed-in users, whose homepages we personalize with listings that they might like, dropped by 95 ms. This is what caused the 50 ms drop in the median backend time for all homepage views this quarter.

The transition to using HHVM for our internal API requests was headed by Dan Miller on our Core Platform team. At Etsy, we like to celebrate the work done on different teams to improve Performance by naming a Performance Hero when exciting improvements are made. Dan was named the first Performance Hero of 2015 for his work on HHVM. Go, Dan!

To learn more about how we use HHVM at Etsy and the benefits that it’s brought us, you can see the slides from his talk HHVM at Etsy, which he gave at PHP UK 2015 Conference. A Code as Craft post about HHVM at Etsy will appear from him in the future, so keep checking back!

Synthetic Front-End Performance – from Kristyn Reith

Below is the synthetic front-end performance data for Q1. For synthetic testing, a third party simulates actions taken by a user and then continuously monitors these actions to generate performance metrics. For this report, the data was collected by Catchpoint, which runs tests every ten minutes on IE9 in New York, London, Chicago, Seattle and Miami. Catchpoint defines the webpage response metric as the time it takes from the request being issued until the last byte of the final element of the page is received. These numbers are all medians and here is the data for the week of March 8-15th 2015 compared to the week of December 15-22nd 2014.

q1_2015_synthetic

To calculate error ranges for our median values, we use Catchpoint’s standard deviation. Based on these standard deviations, the only statistically significant performance regression we saw was for the homepage for both the start render and webpage response times. Looking further into this, we dug into the homepage’s waterfall charts and discovered that Catchpoint’s “response time” metrics are including page elements that load asynchronously. The webpage response time should not account for elements loaded after the document is considered complete. Therefore, this regression is actually no more than a measurement tooling problem and not representative of a real slowdown.

Based on these standard deviations, we saw several improvements. The most noteworthy of these are the start render and webpage response times for the listing page. After investigating potential causes for this performance win, we discovered that this was no more than an error in data collection on our end. The Etsy shop that owns the listing page that we use to collect data in Catchpoint had been put on vacation mode, which temporarily puts the shop “on hold” and hides listings, prior to us pulling the Q1 data. While on vacation mode, the listing for the listing page in question expired on March 7th. So all the data pulled for the week we measured in March does not represent the same version of the listing page that was measured in our previous report, since the expired listing page includes additional suggested items. To avoid having an error like this occur in the future, the performance team will be creating a new shop with a collection of listings, specifically designated for performance testing.

Although the synthetic data for this quarter may seem to suggest that there were major changes, it turned out that the biggest of these were merely errors in our data collection. As we note in the conclusion, we’re going to be overhauling a number of ways we gather data for these reports.

Real User Front-End Performance – from Natalya Hoota

As in our past reports, we are using real user monitoring (RUM) data from mPulse. Real user data, as opposed to synthetic measurements, is sent from users’ browsers in real time.

q1_2015_rum

It does look like the overall trend is global increase in page load time. After a few examinations it appears that most of the slowdown is coming from the front end. A few things to note here – the difference is not significant (less than 10%) with an exception for homepage and profile page.

Homepage load time was affected slightly more than the rest due to two experiments with real time recommendations and page content grouping, both of which are currently ramped down. Profile page showed no outstanding increase in time for the median values; as for the long tail (95 percentile), however, there was a greater change for the worse.

Another interesting nugget that we found was that devices send a different set of metrics to mPulse based on whether their browsers support navigation timing. The navigation timing API was proposed by W3C on 2012, leading to major browsers gradually rolling in support for them. Notably, Apple added it to Safari last July, allowing RUM vendors better insight into users experience. For our data analysis it means the following: we should examine each navigation and resource timing metrics separately, since the underlying data sets are not identical.

In order to make a definitive conclusion, we would need to test statistical validity of that data. In the next quarter we are hoping to incorporate changes that will include better precision in our data collection, analysis and visualization.

Conclusion – from Kristyn Reith

The first quarter of 2015 has included some exciting infrastructure changes. We’ve already begun to see the benefits that have resulted from the introduction of HHVM and we are looking forward to seeing how this continues to impact performance as we transition the rest of our API traffic over.

Keeping with the spirit of exciting changes, and acknowledging the data collection issues we’ve discovered, we will be rolling out a whole new approach to this report next quarter. We will partner with our data engineering team to revamp the way we collect our backend data for better statistical analysis. We will also experiment with different methods of evaluation and visualization to better-represent the speed findings in the data. We’ve also submitted a feature request to Catchpoint to add an alert that’s only triggered if bytes *before* document complete have regressed. With these changes, we look forward to bringing you a more accurate representation of the data across the quarter, so please check back with us in Q2.

1 Comment

Re-Introducing Deployinator, now as a gem!

Posted by on February 20, 2015 / 1 Comment

If you aren’t familiar with Deployinator, it’s a tool we wrote to deploy code to Etsy.com. We deploy code about 40 times per day. This allows us to push smaller changes we are confident about and experiment at a fast rate. Deployinator does a lot of heavy lifting for us. This includes updating source repositories on build machines, minifying/building javascript and css dependencies, kicking off automated tests and updating our staging environment before launching live. But Deployinator doesn’t just deploy Etsy.com, it also manages deploys for a myriad of internal tools, such as our Virtual Machine provisioning system, and can even deploy itself. Within Deployinator, we call each of these independent deployments “stacks”. Deployinator includes a number of helper modules that make writing deployment stacks easy. Our current modules provide helpers for versioning, git operations, and for utilizing DSH. Deployinator works so well for us we thought it best to share. 

Four years ago we open sourced deployinator for OSCON. At the time we created a new project on github with the Etsy related code removed and forked it internally. This diverged and was difficult to maintain for a few reasons. The original public release of Deployinator mixed core and stack related code, creating a tightly coupled codebase; Configuration and code that was committed to our internal fork could not be pushed to public github. Naturally, every internal commit that included private data invariably included changes to the Deployinator core as well. Untangling the public and private bits made merging back into the public fork difficult and over time impossible. If (for educational reasons) you are interested the old code it is still available here.

Today we’d like to announce our re-release of Deployinator as an open source ruby gem (rubygems link).  We built this release with open-source in mind from the start by changing our internal deployinator repository (renamed to DeployinatorStacks for clarity) to include an empty gem created on our public github. Each piece of core deployinator code was then individually untangled and moved into the gem. Since we now depend on the same public Deployinator core we should no longer have problems keeping everything in sync.

While in the process of migrating Deployinator core into the gem it became apparent that we needed a way to hook into common functionality to extend it for our specific implementations. For example, we use graphite to record duration of deploys and the steps within. An example of some of the steps we track are template compilations, javascript and css asset building and rsync times. Since the methods to complete these steps are entirely within the gem, implementing a plugin architecture allows everyone to extend core gem functionality without needing a pull request merged. Our README explains how to create deployment stacks using the gem and includes an example to help you get up and running.

deploy
(Example of how deployinator looks with many stacks)

Major Changes

Deployinator now comes bundled with a simple service to tail running deploys logs to the front end. This replaces some overly complicated streaming middleware that was known to have problems. Deploys are now separate unix processes with descriptive proc titles. Before they were hard to discern requests running under your web server. The combination of these two things decouples deploys from the web request allowing uninterrupted flow in the case of network failures or accidental browser closings. Having separate processes also enables operators to monitor and manipulate deploys using traditional command line unix tools like ps and kill.

This gem release also introduces some helpful namespacing. This means we’re doing the right thing now.  In the previous open source release all helper and stack methods were mixed into every deploy stack and view. This caused name collisions and made it hard to share code between deployment stacks. Now helpers are only mixed in when needed and stacks are actual classes extending from a base class.

We think this new release makes Deployinator more intuitive to use and contribute to and encourage everyone interested to try out the new gem. Please submit feedback as github issues and pull requests. The new code is available on our github. Deployinator is at the core of Etsy’s development and deployment model and how it keeps these fast. Bringing you this release embodies our generosity of spirit in engineering principle. If this sort of work interests you, our team is hiring.

1 Comment

The Art of the Dojo

Posted by on February 17, 2015 / 3 Comments

According to the Wikipedia page I read earlier while waiting in line for a latte, dojo literally means place of the way in Japanese, but in a modern context it’s the gathering spot for students of martial arts. At Etsy and elsewhere, dojos refer to collaborative group coding experiences.

 “Collaborative group coding experience” is a fancy way of saying “we all take turns writing code together” in service of solving a problem. At the beginning of the dojo we set an achievable goal, usually unrelated to our work at Etsy (but often related to a useful skill), and we try to solve that goal in about two hours.

 What that means is a group of us (usually around five at any one time) goes into a conference room with a TV on the wall, plugs one of our laptops into the TV, and each of us takes turns writing code to solve a problem. We literally take turns sitting at the keyboard writing code. We keep a strict three minute timer at the keyboard; after three minutes are up, the person at the keyboard has to get up and let another person use the keyboard. We pick an editor at least one person knows and stick with it–invariably someone will use an editor that isn’t their first choice and that’s fine.

 I often end up organizing the dojos and I’m a sucker for video games, so I usually say, “Hey y’all, let’s make a simple video game using JavaScript and HTML5’s Canvas functionality,” and people say, “kay.” HTML5 games work very well in a short dojo environment because there is instantaneous feedback (good for seeing change happen in a three minute period), there are a variety of challenges to tackle (good for when there are multiple skillsets present), the games we decide to make usually have very simple game mechanics (good for completing the task within a couple of hours), and there is an enormous amount of reward in playing the game you just built from scratch. Some examples of games we’ve successfully done include Pong, Flappy Bird, and Snake.

 That’s it. Easy peasy lemon squeezy. Belying this simplicity is an enormous amount of value to be derived from the dojo experience. Some of the potential applications and benefits of dojos:

 Onboarding. Hook a couple of experienced engineers up with a batch of new hires and watch the value flow. The newbies get immediate facetime with institutional members, they get to see how coding is done in the organization, and they get a chance to cut loose with the cohort they got hired with. The veterans get to meet the newest members of the engineering team, they get to impart some good practices, and they get to share knowledge and teach. These kinds of experiences have a tendency to form strong bonds that stick with people for their entire time at the company. It’s like a value conduit. Plus it’s fun.

 In-House Networking. Most of Etsy’s engineers, PMs, and designers work on teams divided along product or infrastructural lines. This is awesome for incremental improvements to the product and for accumulating domain knowledge, but not so great when you need to branch out to solve problems that exist outside your team. Dojos put engineers, PMs or designers from different teams (or different organizations) together and have them solve a problem together. Having a dojo with people you don’t normally interact with also gives you points of contact all over the company. It’s hard to overstate the value of this–knowing I can go talk to people I’ve dojo-ed with about a problem I’m not sure how to solve makes the anxiety about approaching someone for help evaporate.

 Knowledge Transfer. Sharing knowledge and techniques gained through experience (particularly encoded knowledge) is invaluable within an organization. Being on the receiving end of knowledge transfer is kind of like getting power leveled in your career. Some of my bread and butter skills I use every day I learned from watching other people do them and wouldn’t have figured them out on my own for a long, long time. The most exciting part of a dojo is when someone watching the coder shouts “whoa! waitwaitwaitwait, how did you do that!? Show me!”

 Practice Communicating. Dojos allow for tons of practice at the hardest part of the job: communicating effectively. Having a brilliant idea is useless if you can’t communicate it to anyone. Dojo-ing helps hone communication skills because for the majority of the dojo, you’re not on the keyboard. If you want to explain an idea or help someone who’s stuck on the keyboard, you have to be able to communicate with them.

 Training. I work with a lot of really talented designers who are always looking to improve their skills on the front end (funny how talent and drive to improve seem to correlate). Rather than me sitting over their shoulder telling them what to do, or worse, forcing them to watch me do some coding, a couple of engineers can dojo up with a couple of designers and share techniques and knowledge. This is also applicable across engineering disciplines. We’re currently exploring dojos as a way for us web folks to broaden our iOS and Android skills.

 The Sheer Fun of it All. I’m a professional software engineer in the sense that I get paid to do what I love, and I take good engineering seriously within the context of engineering and the business it serves. Thinking hard about whether I should name this variable in the past tense or present tense, while important, is exhausting. Kicking back for a couple of hours (often with a beer in hand–but there’s no pressure to drink either) and just hacking together a solution with duct tape and staples, knowing I’m not going to look at it after the dojo is a welcomed break. It reminds me of the chaos and fun of when I first learned to program. It also reinforces the idea that the dojo is less about the code and more about the experience. We play the games after we finish them, sometimes as a competition. Playing the game you just built with the people you built it with is a rewarding and satisfying experience, and also serves to gives a sense of purpose and cohesion to the whole experience.

 Our process evolved organically. Our first dojo was a small team of us solving an interview question we asked candidates. We realized that in addition to being helpful, the dojo was also really fun. So we went from there. We moved on to typical coding katas before we finally hit our stride on video games. I encourage anyone starting out with dojos to iterate and find a topic or style that works for you. Some other considerations when running a dojo: there will likely be different skill levels involved. Be sure to encourage people who want contribute – designers, PMs, or anyone in between. It’s scary to put yourself out there in front of people whose opinions you respect; a little bit of reassurance will go a long way towards making people comfortable. Negative feedback and cutting down should be actively discouraged as they do nothing in a dojo but make people feel alienated, unwelcome and stupid. Be sure to have a whiteboard; dojos are a great reminder that a lot of the actual problem solving of software engineering happens away from the keyboard. Make sure you have a timer that everyone can see, hear, and use (if it’s a phone, make sure it has enough charge; the screen usually stays on for the whole dojo which drains the battery quick-like). Apply your typical learning to your dojos and see if you can’t find something that works for you. Most importantly, have fun.

 

3 Comments

Sahale: Visualizing Cascading Workflows at Etsy

Posted by on February 11, 2015 / 9 Comments

The Problem

If you know anything about Etsy engineering culture, you know we like to measure things. Frequent readers of Code As Craft have become familiar with Etsy mantras like “If it moves, graph it.”

When I arrived at Etsy, our Hadoop infrastructure, like most systems at Etsy, was well instrumented at the operational level via Ganglia and the standard suite of exported Hadoop metrics, but was still quite opaque to our end users, the data analysts.

Our visibility problem had to do with a tooling decision made by most Hadoop shops: to abandon tedious and counterintuitive “raw MapReduce” coding in favor of a modern Domain Specific Language like Pig, Hive, or Scalding.

At Etsy, DSL programming on Hadoop has been a huge win. A DSL enables developers to construct complex Hadoop workflows in a fraction of the time it takes to write and schedule a DAG of individual MapReduce jobs. DSLs abstract away the complexities of MapReduce and allow new developers to ramp up on Hadoop quickly.

Etsy’s DSL of choice, Scalding, is a Scala-based wrapper for the Cascading framework. It features a vibrant community, an expressive syntax, and a growing number of large companies like Twitter scaling it out in production.

However, DSLs on Hadoop share a couple of drawbacks that must be addressed with additional tooling. The one that concerned users at Etsy the most was a lack of runtime visibility into the composition and behavior of their Scalding jobs.

The core of the issue is this: Hadoop only understands submitted work at the granularity of individual MapReduce jobs. The JobTracker (or ResourceManager on Hadoop2) is blissfully unaware of the relationships among tasks executed on the cluster, and its web interface reflects this ignorance:

Figure 1: YARN Resource Manager. It’s 4am - do you know where your workflow is?

Figure 1: YARN Resource Manager. It’s 4am – do you know where your workflow is?

A bit dense, isn’t it? This is fine if you are running one raw MapReduce job at a time, but the Cascading framework compiles each workflow down into a set of interdependent MapReduce jobs. These jobs are submitted to the Hadoop cluster asynchronously, in parallel, and ordered only by data dependencies between them.

The result: Etsy users tracking a Cascading workflow via the JobTracker/ResourceManager interface did not have a clear idea how their source code became a Cascading workflow plan, how their Cascading workflows mapped to jobs submitted to Hadoop, which parts of the workflow were currently executing, or how to determine what happened when things went badly.

This situation also meant any attempt to educate users about optimizing workflows, best practices, or MapReduce itself was relegated to a whiteboard, Confluence doc, or (worse) an IRC channel, soon to be forgotten.

What Etsy needed was a user-facing tool for tracking cluster activity at the granularity of Cascading workflows, not MapReduce jobs – a tool that would make core concepts tangible and allow for proactive, independent assessment of workflow behavior at runtime.

Etsy’s Data Platform team decided to develop some internal tooling to solve these problems. The guidelines we used were:

Fast forward to now: I am pleased to announce Etsy’s open-source release of Sahale, a tool for visualizing Cascading workflows, to you, the Hadooping public!

Design Overview

Let’s take a look at how it works under the hood. The design we arrived at involves several discrete components which are deployed separately, as illustrated below:

Figure 2: Design overview.

Figure 2: Design overview.

The components of the tool are: 

The workflow is as follows:

  1. A User launches a Cascading workflow, and a FlowTracker is instantiated.
  2. The FlowTracker begins to dispatch periodic job metric updates to Sahale.
  3. Sahale commits incoming metrics to the database tables.
  4. The user points her browser to the Sahale web app.

About FlowTracker 

That’s all well and good, but how does the FlowTracker capture workflow metrics at runtime?

First, some background. In the Cascading framework, each Cascade is a collection of one or more Flows. Each Flow is compiled by Cascading into one or more MapReduce jobs, which the user’s client process submits to the Hadoop cluster. With a reference to the running Flow, we can track various metrics about the Hadoop jobs the Cascading workflow submits.

Sahale attaches a FlowTracker instance to each Flow in the Cascade. The FlowTracker can capture its reference to the Flow in a variety of ways depending on your preferred Cascading DSL.

By way of an example, let’s take a look at Sahale’s TrackedJob class. Users of the Scalding DSL need only inherit TrackedJob to visualize their own Scalding workflows with Sahale. In the TrackedJob class, Scalding’s Job.run method is overridden to capture a reference to the Flow at runtime, and a FlowTracker is launched in a background thread to handle the tracking:


/**
 * Your jobs should inherit from this class to inject job tracking functionality.
 */
class TrackedJob(args: Args) extends com.twitter.scalding.Job(args) {
  @transient private val done = new AtomicBoolean(false)

override def run(implicit mode: Mode) = {
    mode match {
      // only track Hadoop cluster jobs marked "--track-job"
      case Hdfs(_, _) => if (args.boolean("track-job")) runTrackedJob else super.run
      case _ => super.run
    }
  }

  def runTrackedJob(implicit mode: Mode) = {
    try {
      val flow = buildFlow
      trackThisFlow(flow)
      flow.complete
      flow.getFlowStats.isSuccessful // return Boolean
    } catch {
      case t: Throwable => throw t
    } finally {
      // ensure all threads are cleaned up before we propagate exceptions or complete the run.
      done.set(true)
      Thread.sleep(100)
    }
  }

  private def trackThisFlow(f: Flow[_]): Unit = { (new Thread(new FlowTracker(f, done))).start }
}

The TrackedJob class is specific to the Scalding DSL, but the pattern is easy to extend. We have used the tool internally to track Cascading jobs generated by several different DSLs.

Sahale selects a mix of aggregated and fine-grained metrics from the Cascading Flow and the underlying Hadoop MapReduce jobs it manages. The FlowTracker also takes advantage of the Cascading client’s caching of recently-polled job data whenever possible.

This approach is simple, and has avoided incurring prohibitive data transfer costs or latency, even when many tracked jobs are executing at once. This has proven critical at Etsy where it is common to run many 10’s or 100’s of jobs simultaneously on the same Hadoop cluster. The data model is also easily extendable if additional metrics are desired. Support for a larger range of default Hadoop counters is forthcoming.

About Sahale

Figure 3: Workflow overview page.

Figure 3: Workflow overview page.

The Sahale web app’s home page consists of two tables. The first enumerates running Cascading workflows. The second displays all recently completed workflows. Each table entry includes a workflow’s name, the submitting user, start time, job duration, the number of MapReduce jobs in the workflow, workflow status, and workflow progress.

The navigation bar at the top of the page provides access to the help page to orient new users, and a search feature to expose the execution history for any matching workflows.

Each named workflow links to a detail page that exposes a richer set of metrics:

Figure 4: Workflow details page, observing a job in progress.

Figure 4: Workflow details page, observing a job in progress.

The detail page exposes the workflow structure and job metrics for a single run of a single Cascading workflow. As before, the navigation bar provides links to the help page, back to the overview page, and to a history view for all recent runs of the workflow.

 The detail page consists of five panels:

Users can leverage these displays to access a lot of actionable information as a workflow progresses, empowering them to quickly answer questions like:

In the event of a workflow failure, even novice users can identify the offending job stage or stages as easily as locating the red nodes in the graph view. Selecting failed nodes provides easy access to the relevant Hadoop job logs, and the Sources/SInks tab makes it easy to map a single failed Hadoop job back to the user’s source code. Before Sahale this was a frequent pain point for Hadoop users at Etsy.

Figure 5: A failed workflow. The offending MapReduce job is easy to identify and drill down into.

Figure 5: A failed workflow. The offending MapReduce job is easy to identify and drill down into.

If a user notices a performance regression during iterative workflow development, the Job History link in the navigation bar will expose a comparative view of all recent runs of the same workflow:

Figure 6: History page, including aggregated per-workflow metrics for easy comparisons.

Figure 6: History page, including aggregated per-workflow metrics for easy comparisons.

Here, users can view a set of bar graphs comparing various aggregated metrics from historical runs of the same workflow over time. As in the workflow graphs, color is used to indicate the final status of each charted run or the magnitude of the I/O involved with a particular run.

 Hovering on any bar in a chart displays a popup with additional information. Clicking a bar takes users to the detail page for that run, where they can drill down into the fine-grained metrics. The historical view makes mapping changes in source code back to changes in workflow performance a cinch.

Sahale at Etsy

After running Sahale at Etsy for a year, we’ve seen some exciting changes in the way our users interact with our BigData stack. One of the most gratifying for me is the way new users can ramp up quickly and gain confidence self-assessing their workflow’s performance characteristics and storage impact on the cluster. Here’s one typical example timeline, with workflow and user name redacted to protect the excellent: 

Around January 4th, one of our new users got a workflow up and running. By January 16th, this user had simplified the work needed to arrive at the desired result, cutting the size of the workflow nearly in half. By removing an unneeded aggregation step, the user further optimized the workflow down to a tight 3 stages by Feb. 9th. All of this occurred without extensive code reviews or expert intervention, just some Q&A in our internal IRC channel.

Viewing the graph visualizations for this workflow illustrates its evolution across the timeline much better:

Before

Figure 7a: Before (Jan 4th)

During

Figure 7b: During (Jan. 16th)

After (Feb. 9th)

Figure 7c: After (Feb. 9th)

It’s one thing for experienced analysts to recommend that new users try to filter unneeded data as early in the workflow as possible, or to minimize the number of aggregation steps in a workflow. It’s another thing for our users to intuitively reason about these best practices themselves. I believe Sahale has been a great resource for bridging that gap.

Conclusion

Sahale, for me, represents one of many efforts in our company-wide initiative to democratize Etsy’s data. Visualizing workflows at runtime has enabled our developers to iterate faster and with greater confidence in the quality of their results. It has also reduced the time my team spends determining if a workflow is production-ready before deployment. By open sourcing the tool, my hope is that Sahale can offer the same benefits to the rest of the Big Data community.

Future plans include richer workflow visualizations, more charts, additional metrics/Hadoop counter collection on the client side, more advanced mappings from workflow to users’ source code (for users of Cascading 2.6+), and out-of-the-box support for more Cascading DSLs.

Thanks for reading! As a parting gift, here’s a screenshot of one of our largest workflows in progress:

Figure 8: Etsy runs more Hadoop jobs by 7am than most companies do all day.

Figure 8: Etsy runs more Hadoop jobs by 7am than most companies do all day.

 

9 Comments

Q4 2014 Site performance report

Posted by on February 9, 2015 / 2 Comments

Happy 2015!  We’re back with the site performance report for the fourth quarter of 2014.

We’re doing this report a bit differently than we have in the past. In our past reports, we compared metrics collected on one Wednesday in the current quarter to metrics collected on one Wednesday in the previous quarter. In an effort to get a more normalized look at our performance, for this report we’ve taken data across an entire week in December and are comparing it with data from an entire week in October. We plan to continue to explore new methods for discussing our site’s performance over time in future reports, so stay tuned!

In the meantime, let’s see how the numbers look for this quarter. Like last time, different members of the Performance team will write different sections – Natalya will discuss our server side performance, I will cover our synthetic front-end monitoring, and Jonathan will tell us about our real user front-end monitoring.

Server Side Performance – from Natalya Hoota

“Server side” means literally that, time that it takes for a request to execute on the server. We obtained this data by querying our web logs for the requests made on the five popular pages and our baseline page (barebone version of a page with no content except for a header and footer). The median and 95% numbers of our server side performance are presented below.

Q4 2014 ServerSide

We did not see any significant trend in the data; both medians and 95% results were relatively flat, with delta less than 5% of a total metric.

What we came to realize, however, is the need to reevaluate both our statistical validity of our data and our measurement approach to make sure we can filter the signal from the noise. We are planning to do so in the Q1 2015 report. There are a few open questions here, for example, what is considered a significant difference, how to deal with seasonality, or what is a better way to represent our findings.

Synthetic Front-end Performance – from Allison McKnight

Here is the synthetic front-end performance data for Q4.  This data is collected by Catchpoint running tests on IE9 in New York, London, Chicago, Seattle, and Miami every two hours.  Catchpoint’s webpage response metric measures the time from the request being issued until the last byte of the final element of the page is received.
Q4 2014 RUM

Most of the changes in the median value were attributed to noise. Catchpoint gives us a standard deviation, which we used to calculate error ranges for our median values. The only statistically valid differences were improvements in the homepage web response time and start render time for listing and search pages.

These improvements can be attributed in part to the removal of some resources – serving fewer images on the homepage and the listing page led to a decrease in webpage response time, and two webfonts were removed from the search page.

Real User Monitoring – from Jonathan Klein

Here is the RUM (Real User Monitoring) data collected by mPulse. We measure median and 95th percentile page load time (literally the “Page Load” timer in mPulse).
Q4 2014 Synthetic

The news here is good, albeit unsurprising. The medians effectively haven’t changed since Q3, and the 95th percentiles are down noticeably. The steady medians match what we saw on the synthetic side, which is great to see.

As we mentioned in our last report, we had an issue at one of our CDNs where some CSS files were not being gzipped. This issue was resolved before we pulled this data, and our assumption is that this brought the 95th percentile down. One obvious question is why the homepage had such a large decrease relative to the other pages. We believe this is due to product level changes on our signed-in homepage, but as we mentioned last time we are currently unable to filter signed-in and signed-out users in mPulse. We’ve been told that a feature is on the way to enable that segmentation, and we’ll start differentiating in this report when that happens.

Conclusion- from Natalya Hoota

The last quarter of the 2014 was a relatively quiet time in terms of infrastructure changes, and it showed in the performance metrics. We have seen a slight improvement throughout the site as a result of an experiment with including fewer images and fonts. Fixing a CSS compression bug at one of our CDNs helped our user experience as well.

We would like to find a simple and comprehensive way to represent both the performance trends for the quarter’s duration and give a snapshot of how we are doing at the end of the quarter. Using averages loses details of our users’ experience on the site, and using the daily (and, perhaps, weekly) median taken at the end of the quarter is not taking into account effects that might have occurred during the other parts of the quarter. In the reports to come, we will focus on exploring the methodology of our data collection and interpretation.

2 Comments

Rebuilding the Foundation of Etsy’s Seller Tools

Posted by on February 5, 2015 / 8 Comments

At its core, the Etsy marketplace is powered by its sellers. The Shop Management team at Etsy is comprised of engineers, designers, product managers, usability researchers, and data analysts who work together on the tools that sellers use every day to manage their shops.  These tools are a critical part of how the members of our community run their businesses, and as a result, are a critical part of Etsy’s business as well.

At the start of 2014, we were at an impasse. After years of iteration by many different teams, our seller tools were starting to show their age. New tools that had been added over time were often disconnected from core user workflows. This made new features hard to discover and often cumbersome to integrate into a seller’s daily activities. Design patterns had evolved, making the tools inconsistent and unintuitive, even to experienced shop owners. And despite increasing mobile traffic, few of our seller tools were optimized for mobile web.

From a technical perspective, some of our most important pages had also grown to be our slowest. Much of the business logic of our code was built into continually expanding page-specific web controllers, making it difficult to build and share functionality between the web stack and our native apps.

All of this led to a gradual degradation of our user experience. It was difficult for us to make tools that were intuitive and fit our sellers’ workflows, and slowed our ability to rapidly develop and execute on new ideas.

We decided to re-evaluate how our seller tools are built. One year later, we are extremely excited to announce the release of our new Listings Manager. The task wasn’t simple and it required our team to collaborate on a whole new level to pull it off. Here’s a brief overview of how it all came together.

EtsyListingsManagerLarge

Rewrites are difficult

Anyone who has been in the software industry for a substantial amount of time knows that full-scale rewrites are generally a bad idea. In a famous article by Joel Spolsky, he referred to them as “the single worst strategic mistake that any software company can make.”

And with good reason! Rewrites take much longer than anyone estimates, they can cause new feature development to grind to a halt while chasing parity with your existing project, and, worst of all, you are trading your existing, battle-tested production code for brand new code with its own new bugs and overlooked nuances.

Aside from the technical challenges, rewritten products are often built in isolation from their users. Without regular feedback from real users to keep you on track, you can reach the end of the project and discover that you’ve drifted into building the wrong product — something that makes sense to you, but which fails to actually address the day-to-day problems faced by your users.

Faced with the challenge of solving some massive and fundamental problems with our seller tools, we needed to craft an approach that kept our rewrite technically sustainable and focused on the needs of our seller community. We built a plan that centered around a few major tenets:

Rethinking our CSS

Once the project kicked off, we wanted to get feedback as soon as possible. We started by creating static mockups and put them in front of sellers during remote research sessions. As the form of what we were going to build began to solidify, we found ourselves wanting to prototype features and rapidly iterate on them for the next round of testing. However, creating HTML prototypes was going to be cumbersome with the way we had previously been writing CSS.

In the past, building new features had always involved writing a fair amount of CSS, little of which was reusable. Our CSS was written in terms of pages and features instead of built around much broader patterns. The design team had an existing style guide that documented basic patterns, but the patterns were beginning to show their age; they weren’t responsive and reflected outdated visual styles. A redesign of the seller tools provided us with the perfect opportunity to experiment with mobile-optimized, patternized ways of constructing CSS.

Etsy-Styleguide

Inspired by frameworks like Bootstrap and methodologies like object-oriented CSS, we built a brand new style guide that powered the development of the new seller tools. This style guide was focused around two patterns: component classes and utility classes. Components are abstracted visual patterns reused in many places, like the grid system or a listing card:

// Base alert styles
.alert {
    padding: 20px;
}

// Alert layouts
.alert--inline {
    display: inline-block;
}
.alert--fixed-top {
    position: fixed;
    top: 0;
}

// Alert themes
.alert--red {
    background-color: $red;
}
.alert--green {
    background-color: $green;
}
<div class="alert alert--inline alert--green">
    Your listing has been updated!
</div>

Utilities change visual details in one-off situations, like adding margin or making text bold:

.text-gray-lighter {
    color: $gray-lighter;
}
.text-smaller {
    font-size: 12px;
}
.strong {
    font-weight: bold;
}
<p class="strong">
    Listings Manager
</p>
<p class="text-smaller text-gray-lighter">
    Manage the listings in your Etsy shop
</p>

Everything was written to be responsive and independent of its surroundings. We wanted to be able to quickly build fully responsive pages using components and utilities while using little to no page-specific CSS.

With the style guide in place, spinning up a static HTML prototype for a usability testing session could be done in under an hour. Building multiple versions of a page was no sweat. When user reactions suggested a change in design direction, it was cheap to rip out old UI and build something new in its place, or even iterate on the style guide itself. And when we had settled on a direction after a round of research, we could take the prototype and start hooking it up in the product immediately with little design time needed.

A very intentional side effect of creating building blocks of CSS was that it was easy for engineers to style their work without depending on a designer. We invested a lot of time up front in documenting how to use the style guide so there was a very low barrier to entry for creating a clean, consistent layout. The style guide made it easy to build layouts and style design elements like typography and forms, allowing designers more time to focus on some of the more complex pieces of UI.

Architecting a rich web application

Early on in the design and usability testing process, we saw that part of what didn’t work with the current app was the isolation of features into standalone pages rather than workflows. We began thinking about highly interactive modules that could be used wherever it made sense in a flow. The idea of a single page app began to take hold. And when we thought about the shared API framework and infrastructure we were already building to support mobile apps, it became easy to think of our JavaScript client as just another consumer.

A thick-client JavaScript page was new territory for us. Over the course of the project, we had to address numerous challenges as we tried to integrate this new approach into our current architecture.

One of the most daunting parts of a rewrite is working with unknown techniques and technologies. We evaluated some frameworks and libraries and eventually found ourselves leaning toward Backbone. At Etsy, we value using mature, well-understood technologies, and Backbone was already our go-to solution for adding more structure to the client-side code. We also really liked the simplicity and readability of the library, and we could leverage the existing Backbone code that we had already written across the site.

As we began talking about the app architecture we found ourselves looking for a little more structure, so we began to look at some layers on top of Backbone. We arrived at Marionette, a library whose classes fit the vocabulary we were already using to describe parts of the app: Regions, Lists, Layouts, and Modules. In digging deeper, we found an easy-to-read codebase and a great community. Shortly after we started to build our new platform with Marionette, one of our engineers became a core contributor to the framework. Marionette struck a great balance between providing structure and not being too opinionated. For example, using Behaviors (a way of building reusable interactions) a flexible modal pattern and a particular instance might be defined as:

var OverlayBehavior = Backbone.Marionette.Behavior.extend({
    ui: {
        close: '[data-close]'
    },

    events: {
        'click @ui.close': 'onClickClose',
    },

    onClickClose: function() {
        Radio.commands.execute('app', 'mask:close');
    }
});

var RenewOverlay = Backbone.Marionette.ItemView.extend({
    behaviors: {
        Overlay: {
            behaviorClass: OverlayBehavior
        }
    }
});

And launched using Marionette’s command method:

Radio.commands.execute('app', 'overlay:show', new RenewOverlay());

The app will respond by passing this newly constructed view into an overlay region and taking care of the mask and animations.

With these decisions in place, we began prototyping for our user research while fleshing out some of the open questions we had about our application.

Shipping data to the client efficiently

Our old seller tools were some of the slowest parts of the site, making performance one of our primary concerns. Part of what made these pages slow was the amount of data we needed to access from across our cluster of database shards. Fortunately, at this time the API team was hard at work building a new internal API framework to power fast, custom views to all of our clients by concurrently fanning out from a single endpoint to many others:

<?php
class Api_BespokeAjax_Shop_Listings_Form {
     public function handle($client, $input, $response = null) {
         $listing = $client->shop->listingsGet($input->listing_id);

        $proxies = [
            'listing' => $listing,
            'shipping' => $client->shop->listingsGetShipping($input->listing_id),
            'processing' => $client->shop->listingsGetProcessing($input->listing_id),
            'variations' => $client->shop->listingsGetVariations($input->listing_id),
        ];

        $response_data = Callback_Orchestrator::runAndExtract($proxies, __CLASS__);
        Api_Response::httpCode($response, 200);
        return $response_data;
    }
}

In the above example, the orchestrator resolves the results of the fanout, provides us with a single, composed result. Wrapping existing endpoints meant we didn’t have to rewrite chunks of functionality. The new API endpoints and the style guide gave us the flexibility to put live, high-fidelity working prototypes in front of users with about the same ease as faking it.

So what’s next?

Now that we’ve gone through building a major product on this new platform, we’re already setting our sights on how to improve these foundational components, and evaluating other projects that might make sense to approach in a similar fashion in the future.

Seeing how other features at Etsy can start leveraging some of these innovations today is even more exciting; whether it’s building out features on our new API to share functionality between the web and our apps, or using the client-side architecture and style guide to quickly build out more rich and interactive experiences on Etsy.

With today’s launch of the new Listings Managers, we’ve reached a milestone with this new tooling, but there is still much to do. As we keep iterating on these components and architectures and apply them to new and different products, we’ll no doubt find new problems to solve. We’ll let you know what we find.

 

This article was written as a collaboration between Diana Mounter, Jason Huff, Jessica Harllee, Justin Donato, and Russ Posluszny.

8 Comments