Q1 2015 Site Performance Report

Posted by on March 30, 2015 / 2 Comments

Spring has finally arrived, which means it’s time to share our Site Performance Report for the first quarter of 2015. Like last quarter, in this report, we’ve taken data from across an entire week in March and are comparing it with data from an entire week in December. Since we are constantly trying to improve our data reporting, we will be shaking things up with our methodology for the Q2 report. For backend performance, we plan to randomly sample the data throughout the quarter so it is more statistically sound and a more accurate representation of the full period of time.

We’ve split up the sections of this report among the Performance team and different members will be authoring each section. Allison McKnight is updating us on the server-side performance section, Natalya Hoota is covering the real user monitoring portion, and as a performance bootcamper, I will have the honor of reporting the synthetic front-end monitoring. Over the last three months, front-end and backend performance have remained relatively stable with some variations to specific pages such as, baseline, home, and profile. Now without further ado, let’s dive into the numbers.

Server-Side Performance – from Allison McKnight

Let’s take a look at the server-side performance for the quarter. These are times seen by real users (signed-in and signed-out). The baseline page includes code that is used by all of our pages but has no additional content.


Check that out! None of our pages got significantly slower (slower by at least 10% of their values from last quarter). We do see a 50 ms speedup in the homepage median and a 30 ms speedup in the baseline 95th percentile. Let’s take a look at what happened.

First, we see about a 25 ms speedup in the 95th percentile of the baseline backend time. The baseline page has three components: loading bootstrap/web.php, a bootstrap file for all web requests; creating a controller to render the page; and rendering the baseline page from the controller. We use StatsD, a tool that aggregates data and records it in Graphite, to graph each step we take to load the baseline page. Since we have a timer for each step, I was able to drill down to see that the upper bound for the bootstrap/web step dropped significantly in the end of January:

We haven’t been able to pin down the change that caused this speedup. The bootstrap file performs a number of tasks to set up infrastructure – for example, setting up logging and security checks – and it seems likely that an optimization to one of these processes resulted in a faster bootstrap.

We also see a 50 ms drop in the homepage median backend time. This improvement is from rolling out HHVM for our internal API traffic.

HHVM is a virtual machine developed at Facebook to run PHP and Hack, a PHP-like language developed by Facebook and designed to give PHP programmers access to language features that are unavailable in PHP. HHVM uses just-in-time (JIT) compilation to compile PHP and Hack into bytecode during runtime, allowing for optimizations such as code caching. Both HHVM and Hack are open-source.

This quarter we started sending all of our internal API v3 requests to six servers running HHVM and saw some pretty sweet performance wins. Overall CPU usage in our API cluster dropped by about 20% as the majority of our API v3 traffic was directed to the HHVM boxes; we expect we’ll see an even larger speedup when we move the rest of our API traffic to HHVM.

Time spent in API endpoints dropped. Most notably, we saw a speedup in the search listings endpoint (200 ms faster on the median and 100 ms faster on the 90th percentile) and the fetch listing endpoint (100 ms faster on the median and 50 ms faster on the 90th percentile).

Since these endpoints are used mainly in our native apps, mobile users will have seen a speed boost when searching and viewing listings. Desktop users also saw some benefits: namely, the median homepage backend time for signed-in users, whose homepages we personalize with listings that they might like, dropped by 95 ms. This is what caused the 50 ms drop in the median backend time for all homepage views this quarter.

The transition to using HHVM for our internal API requests was headed by Dan Miller on our Core Platform team. At Etsy, we like to celebrate the work done on different teams to improve Performance by naming a Performance Hero when exciting improvements are made. Dan was named the first Performance Hero of 2015 for his work on HHVM. Go, Dan!

To learn more about how we use HHVM at Etsy and the benefits that it’s brought us, you can see the slides from his talk HHVM at Etsy, which he gave at PHP UK 2015 Conference. A Code as Craft post about HHVM at Etsy will appear from him in the future, so keep checking back!

Synthetic Front-End Performance – from Kristyn Reith

Below is the synthetic front-end performance data for Q1. For synthetic testing, a third party simulates actions taken by a user and then continuously monitors these actions to generate performance metrics. For this report, the data was collected by Catchpoint, which runs tests every ten minutes on IE9 in New York, London, Chicago, Seattle and Miami. Catchpoint defines the webpage response metric as the time it takes from the request being issued until the last byte of the final element of the page is received. These numbers are all medians and here is the data for the week of March 8-15th 2015 compared to the week of December 15-22nd 2014.


To calculate error ranges for our median values, we use Catchpoint’s standard deviation. Based on these standard deviations, the only statistically significant performance regression we saw was for the homepage for both the start render and webpage response times. Looking further into this, we dug into the homepage’s waterfall charts and discovered that Catchpoint’s “response time” metrics are including page elements that load asynchronously. The webpage response time should not account for elements loaded after the document is considered complete. Therefore, this regression is actually no more than a measurement tooling problem and not representative of a real slowdown.

Based on these standard deviations, we saw several improvements. The most noteworthy of these are the start render and webpage response times for the listing page. After investigating potential causes for this performance win, we discovered that this was no more than an error in data collection on our end. The Etsy shop that owns the listing page that we use to collect data in Catchpoint had been put on vacation mode, which temporarily puts the shop “on hold” and hides listings, prior to us pulling the Q1 data. While on vacation mode, the listing for the listing page in question expired on March 7th. So all the data pulled for the week we measured in March does not represent the same version of the listing page that was measured in our previous report, since the expired listing page includes additional suggested items. To avoid having an error like this occur in the future, the performance team will be creating a new shop with a collection of listings, specifically designated for performance testing.

Although the synthetic data for this quarter may seem to suggest that there were major changes, it turned out that the biggest of these were merely errors in our data collection. As we note in the conclusion, we’re going to be overhauling a number of ways we gather data for these reports.

Real User Front-End Performance – from Natalya Hoota

As in our past reports, we are using real user monitoring (RUM) data from mPulse. Real user data, as opposed to synthetic measurements, is sent from users’ browsers in real time.


It does look like the overall trend is global increase in page load time. After a few examinations it appears that most of the slowdown is coming from the front end. A few things to note here – the difference is not significant (less than 10%) with an exception for homepage and profile page.

Homepage load time was affected slightly more than the rest due to two experiments with real time recommendations and page content grouping, both of which are currently ramped down. Profile page showed no outstanding increase in time for the median values; as for the long tail (95 percentile), however, there was a greater change for the worse.

Another interesting nugget that we found was that devices send a different set of metrics to mPulse based on whether their browsers support navigation timing. The navigation timing API was proposed by W3C on 2012, leading to major browsers gradually rolling in support for them. Notably, Apple added it to Safari last July, allowing RUM vendors better insight into users experience. For our data analysis it means the following: we should examine each navigation and resource timing metrics separately, since the underlying data sets are not identical.

In order to make a definitive conclusion, we would need to test statistical validity of that data. In the next quarter we are hoping to incorporate changes that will include better precision in our data collection, analysis and visualization.

Conclusion – from Kristyn Reith

The first quarter of 2015 has included some exciting infrastructure changes. We’ve already begun to see the benefits that have resulted from the introduction of HHVM and we are looking forward to seeing how this continues to impact performance as we transition the rest of our API traffic over.

Keeping with the spirit of exciting changes, and acknowledging the data collection issues we’ve discovered, we will be rolling out a whole new approach to this report next quarter. We will partner with our data engineering team to revamp the way we collect our backend data for better statistical analysis. We will also experiment with different methods of evaluation and visualization to better-represent the speed findings in the data. We’ve also submitted a feature request to Catchpoint to add an alert that’s only triggered if bytes *before* document complete have regressed. With these changes, we look forward to bringing you a more accurate representation of the data across the quarter, so please check back with us in Q2.


Re-Introducing Deployinator, now as a gem!

Posted by on February 20, 2015 / 1 Comment

If you aren’t familiar with Deployinator, it’s a tool we wrote to deploy code to Etsy.com. We deploy code about 40 times per day. This allows us to push smaller changes we are confident about and experiment at a fast rate. Deployinator does a lot of heavy lifting for us. This includes updating source repositories on build machines, minifying/building javascript and css dependencies, kicking off automated tests and updating our staging environment before launching live. But Deployinator doesn’t just deploy Etsy.com, it also manages deploys for a myriad of internal tools, such as our Virtual Machine provisioning system, and can even deploy itself. Within Deployinator, we call each of these independent deployments “stacks”. Deployinator includes a number of helper modules that make writing deployment stacks easy. Our current modules provide helpers for versioning, git operations, and for utilizing DSH. Deployinator works so well for us we thought it best to share. 

Four years ago we open sourced deployinator for OSCON. At the time we created a new project on github with the Etsy related code removed and forked it internally. This diverged and was difficult to maintain for a few reasons. The original public release of Deployinator mixed core and stack related code, creating a tightly coupled codebase; Configuration and code that was committed to our internal fork could not be pushed to public github. Naturally, every internal commit that included private data invariably included changes to the Deployinator core as well. Untangling the public and private bits made merging back into the public fork difficult and over time impossible. If (for educational reasons) you are interested the old code it is still available here.

Today we’d like to announce our re-release of Deployinator as an open source ruby gem (rubygems link).  We built this release with open-source in mind from the start by changing our internal deployinator repository (renamed to DeployinatorStacks for clarity) to include an empty gem created on our public github. Each piece of core deployinator code was then individually untangled and moved into the gem. Since we now depend on the same public Deployinator core we should no longer have problems keeping everything in sync.

While in the process of migrating Deployinator core into the gem it became apparent that we needed a way to hook into common functionality to extend it for our specific implementations. For example, we use graphite to record duration of deploys and the steps within. An example of some of the steps we track are template compilations, javascript and css asset building and rsync times. Since the methods to complete these steps are entirely within the gem, implementing a plugin architecture allows everyone to extend core gem functionality without needing a pull request merged. Our README explains how to create deployment stacks using the gem and includes an example to help you get up and running.

(Example of how deployinator looks with many stacks)

Major Changes

Deployinator now comes bundled with a simple service to tail running deploys logs to the front end. This replaces some overly complicated streaming middleware that was known to have problems. Deploys are now separate unix processes with descriptive proc titles. Before they were hard to discern requests running under your web server. The combination of these two things decouples deploys from the web request allowing uninterrupted flow in the case of network failures or accidental browser closings. Having separate processes also enables operators to monitor and manipulate deploys using traditional command line unix tools like ps and kill.

This gem release also introduces some helpful namespacing. This means we’re doing the right thing now.  In the previous open source release all helper and stack methods were mixed into every deploy stack and view. This caused name collisions and made it hard to share code between deployment stacks. Now helpers are only mixed in when needed and stacks are actual classes extending from a base class.

We think this new release makes Deployinator more intuitive to use and contribute to and encourage everyone interested to try out the new gem. Please submit feedback as github issues and pull requests. The new code is available on our github. Deployinator is at the core of Etsy’s development and deployment model and how it keeps these fast. Bringing you this release embodies our generosity of spirit in engineering principle. If this sort of work interests you, our team is hiring.

1 Comment

The Art of the Dojo

Posted by on February 17, 2015 / 3 Comments

According to the Wikipedia page I read earlier while waiting in line for a latte, dojo literally means place of the way in Japanese, but in a modern context it’s the gathering spot for students of martial arts. At Etsy and elsewhere, dojos refer to collaborative group coding experiences.

 “Collaborative group coding experience” is a fancy way of saying “we all take turns writing code together” in service of solving a problem. At the beginning of the dojo we set an achievable goal, usually unrelated to our work at Etsy (but often related to a useful skill), and we try to solve that goal in about two hours.

 What that means is a group of us (usually around five at any one time) goes into a conference room with a TV on the wall, plugs one of our laptops into the TV, and each of us takes turns writing code to solve a problem. We literally take turns sitting at the keyboard writing code. We keep a strict three minute timer at the keyboard; after three minutes are up, the person at the keyboard has to get up and let another person use the keyboard. We pick an editor at least one person knows and stick with it–invariably someone will use an editor that isn’t their first choice and that’s fine.

 I often end up organizing the dojos and I’m a sucker for video games, so I usually say, “Hey y’all, let’s make a simple video game using JavaScript and HTML5’s Canvas functionality,” and people say, “kay.” HTML5 games work very well in a short dojo environment because there is instantaneous feedback (good for seeing change happen in a three minute period), there are a variety of challenges to tackle (good for when there are multiple skillsets present), the games we decide to make usually have very simple game mechanics (good for completing the task within a couple of hours), and there is an enormous amount of reward in playing the game you just built from scratch. Some examples of games we’ve successfully done include Pong, Flappy Bird, and Snake.

 That’s it. Easy peasy lemon squeezy. Belying this simplicity is an enormous amount of value to be derived from the dojo experience. Some of the potential applications and benefits of dojos:

 Onboarding. Hook a couple of experienced engineers up with a batch of new hires and watch the value flow. The newbies get immediate facetime with institutional members, they get to see how coding is done in the organization, and they get a chance to cut loose with the cohort they got hired with. The veterans get to meet the newest members of the engineering team, they get to impart some good practices, and they get to share knowledge and teach. These kinds of experiences have a tendency to form strong bonds that stick with people for their entire time at the company. It’s like a value conduit. Plus it’s fun.

 In-House Networking. Most of Etsy’s engineers, PMs, and designers work on teams divided along product or infrastructural lines. This is awesome for incremental improvements to the product and for accumulating domain knowledge, but not so great when you need to branch out to solve problems that exist outside your team. Dojos put engineers, PMs or designers from different teams (or different organizations) together and have them solve a problem together. Having a dojo with people you don’t normally interact with also gives you points of contact all over the company. It’s hard to overstate the value of this–knowing I can go talk to people I’ve dojo-ed with about a problem I’m not sure how to solve makes the anxiety about approaching someone for help evaporate.

 Knowledge Transfer. Sharing knowledge and techniques gained through experience (particularly encoded knowledge) is invaluable within an organization. Being on the receiving end of knowledge transfer is kind of like getting power leveled in your career. Some of my bread and butter skills I use every day I learned from watching other people do them and wouldn’t have figured them out on my own for a long, long time. The most exciting part of a dojo is when someone watching the coder shouts “whoa! waitwaitwaitwait, how did you do that!? Show me!”

 Practice Communicating. Dojos allow for tons of practice at the hardest part of the job: communicating effectively. Having a brilliant idea is useless if you can’t communicate it to anyone. Dojo-ing helps hone communication skills because for the majority of the dojo, you’re not on the keyboard. If you want to explain an idea or help someone who’s stuck on the keyboard, you have to be able to communicate with them.

 Training. I work with a lot of really talented designers who are always looking to improve their skills on the front end (funny how talent and drive to improve seem to correlate). Rather than me sitting over their shoulder telling them what to do, or worse, forcing them to watch me do some coding, a couple of engineers can dojo up with a couple of designers and share techniques and knowledge. This is also applicable across engineering disciplines. We’re currently exploring dojos as a way for us web folks to broaden our iOS and Android skills.

 The Sheer Fun of it All. I’m a professional software engineer in the sense that I get paid to do what I love, and I take good engineering seriously within the context of engineering and the business it serves. Thinking hard about whether I should name this variable in the past tense or present tense, while important, is exhausting. Kicking back for a couple of hours (often with a beer in hand–but there’s no pressure to drink either) and just hacking together a solution with duct tape and staples, knowing I’m not going to look at it after the dojo is a welcomed break. It reminds me of the chaos and fun of when I first learned to program. It also reinforces the idea that the dojo is less about the code and more about the experience. We play the games after we finish them, sometimes as a competition. Playing the game you just built with the people you built it with is a rewarding and satisfying experience, and also serves to gives a sense of purpose and cohesion to the whole experience.

 Our process evolved organically. Our first dojo was a small team of us solving an interview question we asked candidates. We realized that in addition to being helpful, the dojo was also really fun. So we went from there. We moved on to typical coding katas before we finally hit our stride on video games. I encourage anyone starting out with dojos to iterate and find a topic or style that works for you. Some other considerations when running a dojo: there will likely be different skill levels involved. Be sure to encourage people who want contribute – designers, PMs, or anyone in between. It’s scary to put yourself out there in front of people whose opinions you respect; a little bit of reassurance will go a long way towards making people comfortable. Negative feedback and cutting down should be actively discouraged as they do nothing in a dojo but make people feel alienated, unwelcome and stupid. Be sure to have a whiteboard; dojos are a great reminder that a lot of the actual problem solving of software engineering happens away from the keyboard. Make sure you have a timer that everyone can see, hear, and use (if it’s a phone, make sure it has enough charge; the screen usually stays on for the whole dojo which drains the battery quick-like). Apply your typical learning to your dojos and see if you can’t find something that works for you. Most importantly, have fun.



Sahale: Visualizing Cascading Workflows at Etsy

Posted by on February 11, 2015 / 9 Comments

The Problem

If you know anything about Etsy engineering culture, you know we like to measure things. Frequent readers of Code As Craft have become familiar with Etsy mantras like “If it moves, graph it.”

When I arrived at Etsy, our Hadoop infrastructure, like most systems at Etsy, was well instrumented at the operational level via Ganglia and the standard suite of exported Hadoop metrics, but was still quite opaque to our end users, the data analysts.

Our visibility problem had to do with a tooling decision made by most Hadoop shops: to abandon tedious and counterintuitive “raw MapReduce” coding in favor of a modern Domain Specific Language like Pig, Hive, or Scalding.

At Etsy, DSL programming on Hadoop has been a huge win. A DSL enables developers to construct complex Hadoop workflows in a fraction of the time it takes to write and schedule a DAG of individual MapReduce jobs. DSLs abstract away the complexities of MapReduce and allow new developers to ramp up on Hadoop quickly.

Etsy’s DSL of choice, Scalding, is a Scala-based wrapper for the Cascading framework. It features a vibrant community, an expressive syntax, and a growing number of large companies like Twitter scaling it out in production.

However, DSLs on Hadoop share a couple of drawbacks that must be addressed with additional tooling. The one that concerned users at Etsy the most was a lack of runtime visibility into the composition and behavior of their Scalding jobs.

The core of the issue is this: Hadoop only understands submitted work at the granularity of individual MapReduce jobs. The JobTracker (or ResourceManager on Hadoop2) is blissfully unaware of the relationships among tasks executed on the cluster, and its web interface reflects this ignorance:

Figure 1: YARN Resource Manager. It’s 4am - do you know where your workflow is?

Figure 1: YARN Resource Manager. It’s 4am – do you know where your workflow is?

A bit dense, isn’t it? This is fine if you are running one raw MapReduce job at a time, but the Cascading framework compiles each workflow down into a set of interdependent MapReduce jobs. These jobs are submitted to the Hadoop cluster asynchronously, in parallel, and ordered only by data dependencies between them.

The result: Etsy users tracking a Cascading workflow via the JobTracker/ResourceManager interface did not have a clear idea how their source code became a Cascading workflow plan, how their Cascading workflows mapped to jobs submitted to Hadoop, which parts of the workflow were currently executing, or how to determine what happened when things went badly.

This situation also meant any attempt to educate users about optimizing workflows, best practices, or MapReduce itself was relegated to a whiteboard, Confluence doc, or (worse) an IRC channel, soon to be forgotten.

What Etsy needed was a user-facing tool for tracking cluster activity at the granularity of Cascading workflows, not MapReduce jobs – a tool that would make core concepts tangible and allow for proactive, independent assessment of workflow behavior at runtime.

Etsy’s Data Platform team decided to develop some internal tooling to solve these problems. The guidelines we used were:

Fast forward to now: I am pleased to announce Etsy’s open-source release of Sahale, a tool for visualizing Cascading workflows, to you, the Hadooping public!

Design Overview

Let’s take a look at how it works under the hood. The design we arrived at involves several discrete components which are deployed separately, as illustrated below:

Figure 2: Design overview.

Figure 2: Design overview.

The components of the tool are: 

The workflow is as follows:

  1. A User launches a Cascading workflow, and a FlowTracker is instantiated.
  2. The FlowTracker begins to dispatch periodic job metric updates to Sahale.
  3. Sahale commits incoming metrics to the database tables.
  4. The user points her browser to the Sahale web app.

About FlowTracker 

That’s all well and good, but how does the FlowTracker capture workflow metrics at runtime?

First, some background. In the Cascading framework, each Cascade is a collection of one or more Flows. Each Flow is compiled by Cascading into one or more MapReduce jobs, which the user’s client process submits to the Hadoop cluster. With a reference to the running Flow, we can track various metrics about the Hadoop jobs the Cascading workflow submits.

Sahale attaches a FlowTracker instance to each Flow in the Cascade. The FlowTracker can capture its reference to the Flow in a variety of ways depending on your preferred Cascading DSL.

By way of an example, let’s take a look at Sahale’s TrackedJob class. Users of the Scalding DSL need only inherit TrackedJob to visualize their own Scalding workflows with Sahale. In the TrackedJob class, Scalding’s Job.run method is overridden to capture a reference to the Flow at runtime, and a FlowTracker is launched in a background thread to handle the tracking:

 * Your jobs should inherit from this class to inject job tracking functionality.
class TrackedJob(args: Args) extends com.twitter.scalding.Job(args) {
  @transient private val done = new AtomicBoolean(false)

override def run(implicit mode: Mode) = {
    mode match {
      // only track Hadoop cluster jobs marked "--track-job"
      case Hdfs(_, _) => if (args.boolean("track-job")) runTrackedJob else super.run
      case _ => super.run

  def runTrackedJob(implicit mode: Mode) = {
    try {
      val flow = buildFlow
      flow.getFlowStats.isSuccessful // return Boolean
    } catch {
      case t: Throwable => throw t
    } finally {
      // ensure all threads are cleaned up before we propagate exceptions or complete the run.

  private def trackThisFlow(f: Flow[_]): Unit = { (new Thread(new FlowTracker(f, done))).start }

The TrackedJob class is specific to the Scalding DSL, but the pattern is easy to extend. We have used the tool internally to track Cascading jobs generated by several different DSLs.

Sahale selects a mix of aggregated and fine-grained metrics from the Cascading Flow and the underlying Hadoop MapReduce jobs it manages. The FlowTracker also takes advantage of the Cascading client’s caching of recently-polled job data whenever possible.

This approach is simple, and has avoided incurring prohibitive data transfer costs or latency, even when many tracked jobs are executing at once. This has proven critical at Etsy where it is common to run many 10’s or 100’s of jobs simultaneously on the same Hadoop cluster. The data model is also easily extendable if additional metrics are desired. Support for a larger range of default Hadoop counters is forthcoming.

About Sahale

Figure 3: Workflow overview page.

Figure 3: Workflow overview page.

The Sahale web app’s home page consists of two tables. The first enumerates running Cascading workflows. The second displays all recently completed workflows. Each table entry includes a workflow’s name, the submitting user, start time, job duration, the number of MapReduce jobs in the workflow, workflow status, and workflow progress.

The navigation bar at the top of the page provides access to the help page to orient new users, and a search feature to expose the execution history for any matching workflows.

Each named workflow links to a detail page that exposes a richer set of metrics:

Figure 4: Workflow details page, observing a job in progress.

Figure 4: Workflow details page, observing a job in progress.

The detail page exposes the workflow structure and job metrics for a single run of a single Cascading workflow. As before, the navigation bar provides links to the help page, back to the overview page, and to a history view for all recent runs of the workflow.

 The detail page consists of five panels:

Users can leverage these displays to access a lot of actionable information as a workflow progresses, empowering them to quickly answer questions like:

In the event of a workflow failure, even novice users can identify the offending job stage or stages as easily as locating the red nodes in the graph view. Selecting failed nodes provides easy access to the relevant Hadoop job logs, and the Sources/SInks tab makes it easy to map a single failed Hadoop job back to the user’s source code. Before Sahale this was a frequent pain point for Hadoop users at Etsy.

Figure 5: A failed workflow. The offending MapReduce job is easy to identify and drill down into.

Figure 5: A failed workflow. The offending MapReduce job is easy to identify and drill down into.

If a user notices a performance regression during iterative workflow development, the Job History link in the navigation bar will expose a comparative view of all recent runs of the same workflow:

Figure 6: History page, including aggregated per-workflow metrics for easy comparisons.

Figure 6: History page, including aggregated per-workflow metrics for easy comparisons.

Here, users can view a set of bar graphs comparing various aggregated metrics from historical runs of the same workflow over time. As in the workflow graphs, color is used to indicate the final status of each charted run or the magnitude of the I/O involved with a particular run.

 Hovering on any bar in a chart displays a popup with additional information. Clicking a bar takes users to the detail page for that run, where they can drill down into the fine-grained metrics. The historical view makes mapping changes in source code back to changes in workflow performance a cinch.

Sahale at Etsy

After running Sahale at Etsy for a year, we’ve seen some exciting changes in the way our users interact with our BigData stack. One of the most gratifying for me is the way new users can ramp up quickly and gain confidence self-assessing their workflow’s performance characteristics and storage impact on the cluster. Here’s one typical example timeline, with workflow and user name redacted to protect the excellent: 

Around January 4th, one of our new users got a workflow up and running. By January 16th, this user had simplified the work needed to arrive at the desired result, cutting the size of the workflow nearly in half. By removing an unneeded aggregation step, the user further optimized the workflow down to a tight 3 stages by Feb. 9th. All of this occurred without extensive code reviews or expert intervention, just some Q&A in our internal IRC channel.

Viewing the graph visualizations for this workflow illustrates its evolution across the timeline much better:


Figure 7a: Before (Jan 4th)


Figure 7b: During (Jan. 16th)

After (Feb. 9th)

Figure 7c: After (Feb. 9th)

It’s one thing for experienced analysts to recommend that new users try to filter unneeded data as early in the workflow as possible, or to minimize the number of aggregation steps in a workflow. It’s another thing for our users to intuitively reason about these best practices themselves. I believe Sahale has been a great resource for bridging that gap.


Sahale, for me, represents one of many efforts in our company-wide initiative to democratize Etsy’s data. Visualizing workflows at runtime has enabled our developers to iterate faster and with greater confidence in the quality of their results. It has also reduced the time my team spends determining if a workflow is production-ready before deployment. By open sourcing the tool, my hope is that Sahale can offer the same benefits to the rest of the Big Data community.

Future plans include richer workflow visualizations, more charts, additional metrics/Hadoop counter collection on the client side, more advanced mappings from workflow to users’ source code (for users of Cascading 2.6+), and out-of-the-box support for more Cascading DSLs.

Thanks for reading! As a parting gift, here’s a screenshot of one of our largest workflows in progress:

Figure 8: Etsy runs more Hadoop jobs by 7am than most companies do all day.

Figure 8: Etsy runs more Hadoop jobs by 7am than most companies do all day.



Q4 2014 Site performance report

Posted by and on February 9, 2015 / 2 Comments

Happy 2015!  We’re back with the site performance report for the fourth quarter of 2014.

We’re doing this report a bit differently than we have in the past. In our past reports, we compared metrics collected on one Wednesday in the current quarter to metrics collected on one Wednesday in the previous quarter. In an effort to get a more normalized look at our performance, for this report we’ve taken data across an entire week in December and are comparing it with data from an entire week in October. We plan to continue to explore new methods for discussing our site’s performance over time in future reports, so stay tuned!

In the meantime, let’s see how the numbers look for this quarter. Like last time, different members of the Performance team will write different sections – Natalya will discuss our server side performance, I will cover our synthetic front-end monitoring, and Jonathan will tell us about our real user front-end monitoring.

Server Side Performance – from Natalya Hoota

“Server side” means literally that, time that it takes for a request to execute on the server. We obtained this data by querying our web logs for the requests made on the five popular pages and our baseline page (barebone version of a page with no content except for a header and footer). The median and 95% numbers of our server side performance are presented below.

Q4 2014 ServerSide

We did not see any significant trend in the data; both medians and 95% results were relatively flat, with delta less than 5% of a total metric.

What we came to realize, however, is the need to reevaluate both our statistical validity of our data and our measurement approach to make sure we can filter the signal from the noise. We are planning to do so in the Q1 2015 report. There are a few open questions here, for example, what is considered a significant difference, how to deal with seasonality, or what is a better way to represent our findings.

Synthetic Front-end Performance – from Allison McKnight

Here is the synthetic front-end performance data for Q4.  This data is collected by Catchpoint running tests on IE9 in New York, London, Chicago, Seattle, and Miami every two hours.  Catchpoint’s webpage response metric measures the time from the request being issued until the last byte of the final element of the page is received.
Q4 2014 RUM

Most of the changes in the median value were attributed to noise. Catchpoint gives us a standard deviation, which we used to calculate error ranges for our median values. The only statistically valid differences were improvements in the homepage web response time and start render time for listing and search pages.

These improvements can be attributed in part to the removal of some resources – serving fewer images on the homepage and the listing page led to a decrease in webpage response time, and two webfonts were removed from the search page.

Real User Monitoring – from Jonathan Klein

Here is the RUM (Real User Monitoring) data collected by mPulse. We measure median and 95th percentile page load time (literally the “Page Load” timer in mPulse).
Q4 2014 Synthetic

The news here is good, albeit unsurprising. The medians effectively haven’t changed since Q3, and the 95th percentiles are down noticeably. The steady medians match what we saw on the synthetic side, which is great to see.

As we mentioned in our last report, we had an issue at one of our CDNs where some CSS files were not being gzipped. This issue was resolved before we pulled this data, and our assumption is that this brought the 95th percentile down. One obvious question is why the homepage had such a large decrease relative to the other pages. We believe this is due to product level changes on our signed-in homepage, but as we mentioned last time we are currently unable to filter signed-in and signed-out users in mPulse. We’ve been told that a feature is on the way to enable that segmentation, and we’ll start differentiating in this report when that happens.

Conclusion- from Natalya Hoota

The last quarter of the 2014 was a relatively quiet time in terms of infrastructure changes, and it showed in the performance metrics. We have seen a slight improvement throughout the site as a result of an experiment with including fewer images and fonts. Fixing a CSS compression bug at one of our CDNs helped our user experience as well.

We would like to find a simple and comprehensive way to represent both the performance trends for the quarter’s duration and give a snapshot of how we are doing at the end of the quarter. Using averages loses details of our users’ experience on the site, and using the daily (and, perhaps, weekly) median taken at the end of the quarter is not taking into account effects that might have occurred during the other parts of the quarter. In the reports to come, we will focus on exploring the methodology of our data collection and interpretation.


Rebuilding the Foundation of Etsy’s Seller Tools

At its core, the Etsy marketplace is powered by its sellers. The Shop Management team at Etsy is comprised of engineers, designers, product managers, usability researchers, and data analysts who work together on the tools that sellers use every day to manage their shops.  These tools are a critical part of how the members of our community run their businesses, and as a result, are a critical part of Etsy’s business as well.

At the start of 2014, we were at an impasse. After years of iteration by many different teams, our seller tools were starting to show their age. New tools that had been added over time were often disconnected from core user workflows. This made new features hard to discover and often cumbersome to integrate into a seller’s daily activities. Design patterns had evolved, making the tools inconsistent and unintuitive, even to experienced shop owners. And despite increasing mobile traffic, few of our seller tools were optimized for mobile web.

From a technical perspective, some of our most important pages had also grown to be our slowest. Much of the business logic of our code was built into continually expanding page-specific web controllers, making it difficult to build and share functionality between the web stack and our native apps.

All of this led to a gradual degradation of our user experience. It was difficult for us to make tools that were intuitive and fit our sellers’ workflows, and slowed our ability to rapidly develop and execute on new ideas.

We decided to re-evaluate how our seller tools are built. One year later, we are extremely excited to announce the release of our new Listings Manager. The task wasn’t simple and it required our team to collaborate on a whole new level to pull it off. Here’s a brief overview of how it all came together.


Rewrites are difficult

Anyone who has been in the software industry for a substantial amount of time knows that full-scale rewrites are generally a bad idea. In a famous article by Joel Spolsky, he referred to them as “the single worst strategic mistake that any software company can make.”

And with good reason! Rewrites take much longer than anyone estimates, they can cause new feature development to grind to a halt while chasing parity with your existing project, and, worst of all, you are trading your existing, battle-tested production code for brand new code with its own new bugs and overlooked nuances.

Aside from the technical challenges, rewritten products are often built in isolation from their users. Without regular feedback from real users to keep you on track, you can reach the end of the project and discover that you’ve drifted into building the wrong product — something that makes sense to you, but which fails to actually address the day-to-day problems faced by your users.

Faced with the challenge of solving some massive and fundamental problems with our seller tools, we needed to craft an approach that kept our rewrite technically sustainable and focused on the needs of our seller community. We built a plan that centered around a few major tenets:

Rethinking our CSS

Once the project kicked off, we wanted to get feedback as soon as possible. We started by creating static mockups and put them in front of sellers during remote research sessions. As the form of what we were going to build began to solidify, we found ourselves wanting to prototype features and rapidly iterate on them for the next round of testing. However, creating HTML prototypes was going to be cumbersome with the way we had previously been writing CSS.

In the past, building new features had always involved writing a fair amount of CSS, little of which was reusable. Our CSS was written in terms of pages and features instead of built around much broader patterns. The design team had an existing style guide that documented basic patterns, but the patterns were beginning to show their age; they weren’t responsive and reflected outdated visual styles. A redesign of the seller tools provided us with the perfect opportunity to experiment with mobile-optimized, patternized ways of constructing CSS.


Inspired by frameworks like Bootstrap and methodologies like object-oriented CSS, we built a brand new style guide that powered the development of the new seller tools. This style guide was focused around two patterns: component classes and utility classes. Components are abstracted visual patterns reused in many places, like the grid system or a listing card:

// Base alert styles
.alert {
    padding: 20px;

// Alert layouts
.alert--inline {
    display: inline-block;
.alert--fixed-top {
    position: fixed;
    top: 0;

// Alert themes
.alert--red {
    background-color: $red;
.alert--green {
    background-color: $green;
<div class="alert alert--inline alert--green">
    Your listing has been updated!

Utilities change visual details in one-off situations, like adding margin or making text bold:

.text-gray-lighter {
    color: $gray-lighter;
.text-smaller {
    font-size: 12px;
.strong {
    font-weight: bold;
<p class="strong">
    Listings Manager
<p class="text-smaller text-gray-lighter">
    Manage the listings in your Etsy shop

Everything was written to be responsive and independent of its surroundings. We wanted to be able to quickly build fully responsive pages using components and utilities while using little to no page-specific CSS.

With the style guide in place, spinning up a static HTML prototype for a usability testing session could be done in under an hour. Building multiple versions of a page was no sweat. When user reactions suggested a change in design direction, it was cheap to rip out old UI and build something new in its place, or even iterate on the style guide itself. And when we had settled on a direction after a round of research, we could take the prototype and start hooking it up in the product immediately with little design time needed.

A very intentional side effect of creating building blocks of CSS was that it was easy for engineers to style their work without depending on a designer. We invested a lot of time up front in documenting how to use the style guide so there was a very low barrier to entry for creating a clean, consistent layout. The style guide made it easy to build layouts and style design elements like typography and forms, allowing designers more time to focus on some of the more complex pieces of UI.

Architecting a rich web application

Early on in the design and usability testing process, we saw that part of what didn’t work with the current app was the isolation of features into standalone pages rather than workflows. We began thinking about highly interactive modules that could be used wherever it made sense in a flow. The idea of a single page app began to take hold. And when we thought about the shared API framework and infrastructure we were already building to support mobile apps, it became easy to think of our JavaScript client as just another consumer.

A thick-client JavaScript page was new territory for us. Over the course of the project, we had to address numerous challenges as we tried to integrate this new approach into our current architecture.

One of the most daunting parts of a rewrite is working with unknown techniques and technologies. We evaluated some frameworks and libraries and eventually found ourselves leaning toward Backbone. At Etsy, we value using mature, well-understood technologies, and Backbone was already our go-to solution for adding more structure to the client-side code. We also really liked the simplicity and readability of the library, and we could leverage the existing Backbone code that we had already written across the site.

As we began talking about the app architecture we found ourselves looking for a little more structure, so we began to look at some layers on top of Backbone. We arrived at Marionette, a library whose classes fit the vocabulary we were already using to describe parts of the app: Regions, Lists, Layouts, and Modules. In digging deeper, we found an easy-to-read codebase and a great community. Shortly after we started to build our new platform with Marionette, one of our engineers became a core contributor to the framework. Marionette struck a great balance between providing structure and not being too opinionated. For example, using Behaviors (a way of building reusable interactions) a flexible modal pattern and a particular instance might be defined as:

var OverlayBehavior = Backbone.Marionette.Behavior.extend({
    ui: {
        close: '[data-close]'

    events: {
        'click @ui.close': 'onClickClose',

    onClickClose: function() {
        Radio.commands.execute('app', 'mask:close');

var RenewOverlay = Backbone.Marionette.ItemView.extend({
    behaviors: {
        Overlay: {
            behaviorClass: OverlayBehavior

And launched using Marionette’s command method:

Radio.commands.execute('app', 'overlay:show', new RenewOverlay());

The app will respond by passing this newly constructed view into an overlay region and taking care of the mask and animations.

With these decisions in place, we began prototyping for our user research while fleshing out some of the open questions we had about our application.

Shipping data to the client efficiently

Our old seller tools were some of the slowest parts of the site, making performance one of our primary concerns. Part of what made these pages slow was the amount of data we needed to access from across our cluster of database shards. Fortunately, at this time the API team was hard at work building a new internal API framework to power fast, custom views to all of our clients by concurrently fanning out from a single endpoint to many others:

class Api_BespokeAjax_Shop_Listings_Form {
     public function handle($client, $input, $response = null) {
         $listing = $client->shop->listingsGet($input->listing_id);

        $proxies = [
            'listing' => $listing,
            'shipping' => $client->shop->listingsGetShipping($input->listing_id),
            'processing' => $client->shop->listingsGetProcessing($input->listing_id),
            'variations' => $client->shop->listingsGetVariations($input->listing_id),

        $response_data = Callback_Orchestrator::runAndExtract($proxies, __CLASS__);
        Api_Response::httpCode($response, 200);
        return $response_data;

In the above example, the orchestrator resolves the results of the fanout, provides us with a single, composed result. Wrapping existing endpoints meant we didn’t have to rewrite chunks of functionality. The new API endpoints and the style guide gave us the flexibility to put live, high-fidelity working prototypes in front of users with about the same ease as faking it.

So what’s next?

Now that we’ve gone through building a major product on this new platform, we’re already setting our sights on how to improve these foundational components, and evaluating other projects that might make sense to approach in a similar fashion in the future.

Seeing how other features at Etsy can start leveraging some of these innovations today is even more exciting; whether it’s building out features on our new API to share functionality between the web and our apps, or using the client-side architecture and style guide to quickly build out more rich and interactive experiences on Etsy.

With today’s launch of the new Listings Managers, we’ve reached a milestone with this new tooling, but there is still much to do. As we keep iterating on these components and architectures and apply them to new and different products, we’ll no doubt find new problems to solve. We’ll let you know what we find.


This article was written as a collaboration between Diana Mounter, Jason Huff, Jessica Harllee, Justin Donato, and Russ Posluszny.


Transitioning to SCSS at Scale

Posted by on February 2, 2015 / 18 Comments

Naively, CSS appears easy to comprehend — it doesn’t have many programming constructs, and it’s a declarative syntax that describes the appearance of the DOM rather than an executable language. Ironically it’s this lack of functionality that can make CSS difficult to reason about. The inability to add scripting around where and when selectors are executed can make wide-reaching changes to CSS risky.

CSS preprocessors introduce advanced features to CSS that the current iteration of the CSS specification does not.  This functionality commonly includes variables, functions, mixins and execution scope, meaning that developers can embed logic that determines how CSS is written and executed.  If correctly applied preprocessors can go a long way towards making CSS more modular and DRY, which in turn result in long-term maintainability wins for a codebase.

One of the goals of the Front-end Infrastructure Team for 2014 was to fully transition the CSS codebase at Etsy to SCSS [1]. SCSS is a mature, versatile CSS preprocessor, and Etsy’s designers and developers decided to integrate it into our tech stack.  However, we knew that this effort would be non-trivial with a codebase of our size.  As of October 2014, we had 400,000+ lines of CSS split over 2000+ files.

In tandem with a team of designers, the Front-end Infrastructure Team began developing the processes to deploy SCSS support to all development environments and our build pipeline. In this post I’ll cover the logic behind our decisions, the potential pitfalls of a one-time CSS-to-SCSS conversion, and how we set up tooling to optimize for maintainability moving forward.


The biggest validation of the potential for SCSS at Etsy was the small team of designers beta-testing it for more than six months before our work began.  Since designers at Etsy actively push code, a single product team led the initial charge to integrate SCSS into their workflow.  They met regularly to discuss what was and was not working for them and began codifying their work into a set of best practices for their own project.

It was through the initial work of this product team that the rest of the company began to see the value and viability of introducing a CSS preprocessor.  The input of these designers proved invaluable when the Front-end Infrastructure Team began meeting to hatch a plan for the deployment of SCSS company-wide, and their list of best practices evolved into an SCSS style guide for future front-end development.

After evaluating the landscape of CSS preprocessors we decided to move forward with SCSS. SCSS is an extremely popular project with an active developer community, it is feature rich with excellent documentation, and because the SCSS (Sassy CSS) syntax is a superset of CSS3, developers wouldn’t have to learn a new syntax to start using SCSS immediately. With regards to performance, the Sass team prioritizes the development and feature parity of libsass, a C/C++ port of the Sass engine [2]. We assumed that using libsass via the NodeJS bindings provided by node-sass would enable us to integrate an SCSS compilation step into our builds without sacrificing speed or build times.

We were also excited about software released by the larger SCSS community, particularly tools like scss-lint. In order to compile SCSS we knew that a conversion of CSS files to SCSS meant remedying any syntactical bugs within our CSS code base.  Since our existing CSS did not have consistently-applied coding conventions, we took the conversion as an opportunity to create a consistent, enforceable style across our existing CSS.  Coupling this remediation with a well-defined style guide and a robust lint, we could implement tooling to keep our SCSS clean, performant and maintainable moving forward.

An Old and New CSS Pipeline

Our asset build pipeline is called “builda” (short for “build assets”). It was previously a set of PHP scripts that handled all JS and CSS concatenation/minification/versioning. When using libraries written in other languages (e.g. minification utilities), builda would shell out to those services from PHP. On developer virtual machines (VMs) builda would build CSS dynamically per request, while it would write concatenated, minified and versioned CSS files to disk in production builds.

We replaced the CSS component of builda with an SCSS pipeline written in NodeJS. We chose Node for three reasons. First, we had already re-written the JavaScript build component in Node a year ago, so the tooling and strategies for deploying another Node service internally were familiar to us. Second, we’ve found that writing front-end tools in JavaScript opens the door for collaboration and pull requests from developers throughout the organization. Finally, a survey of the front-end software ecosystem reveals a strong preference towards JavaScript, so writing our build tools in JS would allow us to keep a consistent codebase when integrating third-party code.

One of our biggest worries in the planning stages was speed. SCSS would add another compilation step to an already extensive build process, and we weren’t willing to lengthen development workflows or build times. Fortunately, we found that by using libsass we could achieve a minimum of 10x faster compilation speeds over the canonical Ruby gem.

We were dedicated to ensuring that the SCSS builda service was a seamless upgrade from the old one. We envisioned writing SCSS in your favorite text editor, refreshing the browser, and having the CSS render automatically from a service already running on your VM — just like the previous CSS workflow. In production, the build pipeline would still output properly compiled, minified and versioned CSS to our web servers.

Despite a complete rewrite of the CSS service, with a robust conversion process and frequent sanity checking, we were able to replace CSS with SCSS and avoid any disruptions. Workflows were identical to before the rewrite and developers began writing SCSS from day one.

Converting Legacy Code

In theory, converting CSS to SCSS is as simple as changing the file extension from .css to .scss.  In practice it’s much more complicated.

Here’s what’s hard about CSS: It fails quietly.  If selectors are malformed or parameters are written incorrectly (i.e. #0000000 instead of #000000), the browser simply ignores the rule. These errors were a blocker on our conversion because when SCSS is compiled, syntax errors will prevent the file from compiling entirely.

But errors were only one part of the problem. What about intentionally malformed selectors in the form of IE-hacks? Or, what about making changes to legacy CSS in order to conform to new lint rules that we’d impose on our SCSS? For example, we wanted to replace every instance of a CSS color-keyword with its hex value.

Our conversion was going to touch a lot of code in a lot of places. Would we break our site by fixing our CSS?  How could we be confident that our changes wouldn’t cause visual regressions?

Conventionally there are some patterns to solve this problem. A smaller site might remedy the syntax bugs, iterate every page with a headless browser and create visual diffs for changes. Alternatively, given a certain size it might even be possible to manually regression test each page to make sure the fixes render smoothly.

Unfortunately our scale and continuous experimentation infrastructure makes both options impossible, as there are simply far too many different combinations of pages/experiments to test against, all subject to change at a moments notice.  A back of the envelope calculation puts the number of possible variants of Etsy.com at any time at ~1.2M.

We needed to clean any incorrect CSS and enforce new lint rules before we performed the SCSS rename, and we needed to confirm that those fixes wouldn’t visually break our site without the option to look at every page. We broke the solution into two distinct steps: the “SCSS Clean” and the “SCSS Diff.”

SCSS Clean

We evaluated various ways to perform the CSS fixes, initially involving an extensive list of regular expressions to transform incorrect patterns we identified in the code. But that method quickly became untenable as our list of regular expressions was difficult to reason about.

Eventually we settled on our final method: using parsers to convert any existing source CSS/SCSS code into Abstract Syntax Trees (AST), which we could then manipulate to transform specific types of nodes.  For the unfamiliar, an AST is a representation of the structure of parsed source code.  We used the Reworkcss CSS parser to generate CSS ASTs and gonzales-pe to generate SCSS ASTs, and wrote a custom adapter between the two formats to streamline our style and syntax changes.  For an example into what a generated AST might look like, here’s a great example from the Reworkcss CSS parser.

By parsing our existing CSS/SCSS into ASTs, we could correct errors at a much more granular level by targeting selectors or errors of specific types. Going back to the color-keyword example, this gave us a cleaner way to replace properties that specified color values as color-keywords (“black”) with their equivalent hexadecimal representation (#000000).  By using an AST we could perform the replacement without running the risk of replacing color words in unintended locations (e.g. selectors: “.black-header”) or navigating a jungle of regular expressions.

In summary, our cleaning process was:

  1. Generate an AST for the existing CSS/SCSS file.
  2. Run a script we created to operate over the AST to identify and fix errors/discrepancies on a per-property level.
  3. Save the output as .scss.
  4. Run the .scss file through the libsass compiler until the successful compilation of all files.
  5. Iterate on steps #2-4, including manual remediation efforts on specific files as necessary.


Cleaning our CSS was only half the battle. We also needed a way to confirm that our cleaned CSS wouldn’t break our site in unexpected ways, and to make that determination automatically across thousands of files.

Again we turned to ASTs.  ASTs strip away superficial differences in source code to core language constructs.  Thus we could conclude that if two ASTs were deeply equivalent, regardless of superficial differences in their source, they would result in the same rendered CSS.

We used our Continuous Integration (Jenkins) server to execute the following process and alert us after each push to production:

  1. Run the old builda process with the original, untouched CSS, resulting in the minified, concatenated and versioned CSS that gets deployed to production servers and the live site.  Build an AST from this output.
  2. Concurrently to step 1, run the SCSS conversion/clean, generating SCSS files from CSS.  Run these SCSS files through SCSS builda, resulting in minified, concatenated, and versioned CSS from which we could generate an AST.
  3. Diff the ASTs from steps 1 and 2.
  4. Display and examine the diff.  Iterate on steps 1-3, modifying the cleaning script, the SCSS builda code or manually addressing issues in CSS source until the ASTs are equivalent.


With equivalent ASTs we gained confidence that despite touching the thousands of CSS files across Etsy.com, the site would look exactly the same before and after our SCSS conversion.  Integrating the process into CI gave us a quick and safe way to surface the limits of our cleaning/SCSS implementations by using live code but not impacting production.

Going Live

With sufficient confidence via AST diffing, our next step was to determine how to deploy to production safely. Here was our deployment strategy:


Using the original CSS as source, we added the SCSS builda process to our deployment pipeline. On a production push it would take the original CSS, clean it, create SCSS files and then compile them to CSS files in a separate directory on our production servers. We continued to serve all traffic the CSS output of our existing build process and kept the new build flagged off for production users.  This allowed us to safely run a dress rehearsal of the new conversion and build systems during deployments and monitor the system for failures.


Once the SCSS builda process ran for several days (with 25-50 pushes per day) without incident, we used our feature flagging infrastructure to ramp up 50% of our production users to use the new SCSS builda output. We monitored graphs for issues.

SCSSatScale_4 After several days at 50%, we ramped up SCSS builda output to 100% of Etsy.com users and continued to monitor graphs.


The final step was to take a few hours to hold production pushes and convert our CSS source to the converted SCSS.  Since our SCSS builda process generated its own cleaned SCSS, transitioning our source was as simple as replacing the contents of our css/ directory with those generated SCSS files.

One 1.2M-line deployment later, Etsy.com was running entirely on SCSS source code.

Optimizing for Developer Productivity and Code Maintainability

We knew that integrating a new technology into the stack such as SCSS would require up-front work on our end with regards to communication, teaching and developer tools. Beyond just the work related to the build pipeline it was important to make sure developers felt confident writing SCSS from day one.

Communication and Teaching

The style guide and product work by the initial SCSS design team was key in showing the value of adopting SCSS to others throughout the organization. The speed at which new, consistent and beautiful pages could be created with the new style guide was impressive.  We worked with the designers closely on email communication and lunch-and-learn sessions before the official SCSS launch day and crafted documentation within our internal wiki.

Developer Tools and Maintainability

Beyond syntax differences, there are a couple of core pitfalls/pain points for developers when using SCSS:

  1. SCSS is compiled, so syntax errors explode compilation and no CSS hits the page.
  2. You can accidentally bloat your CSS by performing seemingly harmless operations (here’s looking at you, @extend).
  3. Nested @import’s within SCSS files can complicate tracing the source files for specific selectors.

We found the best way to remedy both was to integrate feedback into development environments.

For broken SCSS, a missing/non-compiling CSS file becomes an error message at the top of the document:


For maintainability, the integration of a live, in-browser SCSS lint was invaluable:


The lint rules defined by our designers help keep our SCSS error-free and consistent and are used within both our pre-deployment test suites and in-browser lint.  Luckily the fantastic open source project scss-lint has a variety of configurable lint rules right out of the box.

Lastly, due to nested SCSS file structures, source maps for inspecting file dependencies in browser developer tools were a must. These were straightforward to implement since libsass provides source map support.

With the SCSS build processes, live lint, source maps, test suite upgrades and education around the new style guide, our final internal conversion step was pushing environment updates to all developer VMs. Similarly to the SCSS production pipeline the developer environments involved rigorous testing and iteration, and gathering feedback from an opt-in developer test group was key before rolling out the tooling to the entire company.

Conclusions and Considerations

The key to making any sweeping change within a complex system is building confidence, and transitioning from CSS to SCSS was no different.  We had to be confident that our cleaning process wouldn’t produce SCSS that broke our site, and we had to be confident that we built the right tools to keep our SCSS clean and maintainable moving forward. With proper education, tooling and sanity checks throughout the process, we were able to move Etsy to SCSS with minimal disruption to developer workflows or production users.

  1. We use SCSS to refer the CSS-syntax version of Sass. For all intents and purposes, SCSS and Sass are interchangeable throughout this post.
  2. In order to maintain our old system’s build behavior and prevent redundant CSS imports, we forked libsass to support compass-style import-once behavior.
  3. Graphics from Daniel Espeset – Making Maps: The Role of Frontend Infrastructure at Etsy – Fronteers 2014 Presentation (http://talks.desp.in/fronteers2014/)

You can follow Dan on Twitter at @dxna.


Announcing Hound: A Lightning Fast Code Search Tool

Posted by and on January 27, 2015 / 31 Comments

Today we are open sourcing a new tool to help you search large, complex codebases at lightning speed. We are calling this tool Hound. We’ve been using it internally for a few months, and it has become an indispensable tool that many engineers use every day.

The Problem Hound Solves

Before Hound, most engineers used ack or grep to search code, but with our growing codebase this started taking longer and longer. Even worse, searching across multiple repositories was so slow and cumbersome that it was becoming frustrating and error prone. Since it is easy to overlook dependencies between repositories, and hard to even know which you might need to search, this was a big problem. Searching multiple repositories was especially painful if you wanted to search a repo that you didn’t have cloned on your machine. Due to this frustration, Kelly starting working on a side project to try out some of the ideas and code from this article by Russ Cox. The code was robust, but we wanted to tweak some of the criteria for excluding files, and create a web UI that could talk to the search engine.

The end goal was to provide a simple web front-end with linkable search results that could perform regular expression searches on all of our repositories quickly and accurately. Hound accomplishes this with a static React front-end that talks to a Go backend. The backend keeps an up-to-date index for each repository and answers searches through a minimal API.

Why Create a New Tool?

Some people might point out that tools like this already exist – OpenGrok comes to mind immediately. Our main beef with OpenGrok is that it is difficult to deploy, and it has a number of hard requirements that are not trivial to install and configure. To run Hound you just need Go 1.3+. That’s it. A browser helps to see the web UI, but with a command line version on the way and a Sublime Text plugin already live, the browser is optional. We wanted to lower the barrier to entry for this tool so much that anyone, anywhere, could starting using Hound to search their code in seconds or minutes, not hours.

Get It

Hound is easy enough to use that we recommend you just clone it and see for yourself. We have committed to using the open source version of Hound internally, so we hope to address issues and pull requests quickly. That’s enough jibber jabber – check out the quick start guide and get searching!


Hound trained at Etsy by  Jonathan Klein and Kelly Norton


Introducing statsd-jvm-profiler: A JVM Profiler for Hadoop

Posted by on January 14, 2015 / 13 Comments

At Etsy we run thousands of Hadoop jobs over hundreds of terabytes of data every day.  When operating at this scale optimizing jobs is vital: we need to make sure that users get the results they need quickly, while also ensuring we use our cluster’s resources efficiently.  Actually doing that optimizing is the hard part, however.  To make accurate decisions you need measurements, and so we have created statsd-jvm-profiler: a JVM profiler that sends the profiling data to StatsD.

Why Create a New Profiler?

There are already many profilers for the JVM, including VisualVM, YourKit, and hprof.  Why do we need another one?  Those profilers are all excellent tools, and statsd-jvm-profiler is not intended to entirely supplant them.  Instead, statsd-jvm-profiler, inspired by riemann-jvm-profiler, is designed for a specific use-case: quickly and easily profiling Hadoop jobs.

Profiling Hadoop jobs is a complex process.  Each map and reduce task gets a separate JVM, so one job could have hundreds or even thousands of distinct JVMs, running across the many nodes of the Hadoop cluster.  Using frameworks like Scalding complicates it further: one Scalding job will run multiple Hadoop jobs, each with many distinct JVMs.  As such it is not trivial to determine exactly where the code you want to profile is running.  Moreover, storing and transferring the snapshot files produced by some profilers has also been problematic for us due to the large size of the snapshots.  Finally, at Etsy we want our big data stack to be accessible to as many people as possible, and this includes tools for optimizing jobs.  StatsD and Graphite are used extensively throughout Etsy, so by sending data to StatsD, statsd-jvm-profiler enables users to use tools they are already familiar with to explore the profiling data.

Writing the Profiler

For simplicity, we chose to write statsd-jvm-profiler is a Java agent, which means it runs in the same JVM as the process being instrumented.  The agent code runs before the main method of that process.  Implementing an agent is straightforward: define a class that has a premain method with this signature:

package com.etsy.agent;

import java.lang.instrument.Instrumentation;

public class ExampleAgent {
    public static void premain(String args, Instrumentation instrumentation) {
        // Agent code here

The agent class should be packaged in a JAR whose manifest specifies the Premain-Class attribute:

Premain-Class: com.etsy.agent.ExampleAgent

We are using Maven to build statsd-jvm-profiler, so we use the maven-shade-plugin’s ManifestResourceTransformer to set this property, but other build tools have similar facilities.

Finally, we used the JVM’s management interface to actually obtain the profiling data.  java.lang.management.ManagementFactory provides a number of MXBeans that expose information about various components of the JVM, including memory usage, the garbage collector, and running threads.  By pushing this data to StatsD, statsd-jvm-profiler removes the need to worry about where the code is running – all the metrics are available in a central location.


There were some issues that came up as we developed statsd-jvm-profiler.  First, statsd-jvm-profiler uses a ScheduledExecutorService to periodically run the threads that actually perform the profiling.  However, the default ScheduledExecutorService runs as a non-daemon thread, which means it will keep the JVM alive, even though the main thread may have exited.  This is not ideal for a profiler, as it will keep the JVM alive and continue to report profiling data even though nothing is happening other than the profiler.  Guava has functionality to create a ScheduledExecutorService that will exit when the application is complete, which statsd-jvm-profiler uses to work around this issue. 

Safepoints are another interesting aspect of profiling the JVM.  A thread is at a safepoint when it is in a known state: all roots for garbage collection are known and all heap contents are consistent.  At a safepoint, a thread’s state can be safely observed or manipulated by other threads.  Garbage collection must occur at a safepoint, but a safepoint is also required to sample the thread state like statsd-jvm-profiler does.  However, the JVM can optimize safepoints out of hot methods.  As such, statsd-jvm-profiler’s sampling can be biased towards cold methods.  This is not a problem unique to statsd-jvm-profiler – any profiler that samples the thread state like statsd-jvm-profiler does would have the same bias.  In practice this bias may not be that meaningful.  It is important to be aware of, but an incomplete view of application performance that still enables you to make improvements is better than no information.

How to Use statsd-jvm-profiler

statsd-jvm-profiler will profile heap and non-heap memory usage, garbage collection, and the aggregate time spent executing each function.  You will need the statsd-jvm-profiler jar on the host where the JVM you want to profile will run.  Since statsd-jvm-profiler is a Java agent, it is enabled with the -javaagent argument to the JVM.  You are required to provide the hostname and port number for the StatsD instance to which statsd-jvm-profiler should send metrics.  You can also optionally specify a prefix for the metrics emitted by statsd-jvm-profiler as well as filters for the functions to profile. 


An example of using statsd-jvm-profiler to profile Scalding jobs is provided with the code.

statsd-jvm-profiler will output metrics under the “statsd-jvm-profiler” prefix by default, or you can specify a custom prefix.  Once the application being profiled has finished, all of the data statsd-jvm-profiler produced will be available in whatever backend you are using with StatsD.  What do you do with all that data? Graph it!  We have found flame graphs to be a useful method of visualizing the CPU profiling data, and a script to output data from Graphite into a format suitable for generating a flame graph is included with statsd-jvm-profiler:

Example Flame Graph

The memory usage and garbage collection metrics can be visualized directly:

Example Metrics

Using the Profiler’s Results

We’ve already used the data from statsd-jvm-profiler to determine how best to optimize jobs.  For example, we wanted to profile a job after some changes that had made it slower.  The flame graph made it obvious where the job was spending its time.  The wide bars on the left and right of this image are from data serialization/deserialization.  As such we knew that speeding up the job would come from improving the serialization or reducing the amount of data being moved around – not in optimizing the logic of the job itself. 

Flame Graph with Serialization

We also made a serendipitous discovery while profiling that job: it had been given 3 Gb of heap, but it was not using anywhere near that much.  As such we could reduce its heap size.  Such chance findings are a great advantage of making profiling simple.  You are more likely to to make these chance discoveries if you profile often and make analysis of your profiling data easier.  statsd-jvm-profiler and Graphite solve this problem for us.

Get statsd-jvm-profiler

Want to try it out yourself?  statsd-jvm-profiler is available on Github now!


Q3 2014 Site Performance Report

Posted by , and on December 22, 2014 / 6 Comments

We’re well into the fourth quarter of 2014, and it’s time to update you on how we did in Q3. This is either a really late performance report or an early Christmas present – you decide! The short version is that server side performance improved due to infrastructure changes, while front-end performance got worse because of third party content and increased amounts of CSS/JS across the site. In most cases the additional CSS/JS came from redesigns and new features that were part of winning variants that we tested on site. We’ve also learned that we need some better front-end monitoring and analysis tools to detect regressions and figure out what caused them.

We are also trying something new for this post – the entire performance team is involved in writing it! Allison McKnight is authoring the server side performance section, Natalya Hoota is updating us on the real user monitoring numbers, and Jonathan Klein is doing synthetic monitoring, this introduction, and the conclusion. Let’s get right to it – enter Allison:

Server Side Performance – From Allison McKnight

Here are median and 95th percentile times for signed-in users on October 22nd:



The homepage saw a significant jump in median load times as well as an increase in 95th percentile load times. This is due to our launch of the new signed-in homepage in September. The new homepage shows the activity feed of the signed-in user and requires more back-end time to gather that data. While the new personalization on the homepage did create a performance regression, the redesign had a significant increase in business and engagement metrics. The redesign also features items from a more diverse set of shops on the homepage and displays content that is more relevant to our users. The new version of the homepage is a better user experience, and we feel that this justifies the increase in load time. We are planning to revisit the homepage to further tune its performance in the beginning of 2015.

The rest of the pages saw a nice decrease in both median and 95th percentile back-end times. This is due to a rollout of some new hardware – we upgraded our production memcached boxes.

We use memcached for a lot of things at Etsy. In addition to caching the results of queries to the database, we use memcached to cache things like a user’s searches or activity feed. If you read our last site performance report, you may remember that in June we rolled out a dedicated memcached cluster for our listing cards, landing a 100% cache hit rate for listing data along with a nice drop in search page server side times.

Our old memcached boxes had 1Gb networking and 48GB of memory. We traded them out for new machines with 10Gb networking, 128GB of memory, faster clocks, and a newer architecture. After the switch we saw a drop in memcached latency. This resulted in a visible drop in both the median and 95th percentile back-end times of our pages, especially during our peak traffic times. This graph compares the profile page’s median and 95th percentile server side times on the day after the new hardware was installed and on the same day a week before the switch:


All of the pages in this report saw a similar speedup.

At this point, you may be wondering why the search page has seen such a small performance improvement since July if it was also affected by the new memcached machines. In fact, the search page saw about the same speedup from the new boxes as the rest of the pages did, but this speedup was balanced out by other changes to the page throughout the past three months. A lot of work has been going on with the search page recently, and it has proven hard to track down any specific changes that have caused regressions in performance. All told, in the past three months we’ve seen a small improvement in the search page’s performance and many changes and improvements to the page from the Search team. Since this is one of our faster pages, we’re happy with a small performance improvement along with the progress we’ve made towards our business goals in the last quarter.

Synthetic Front-End Performance – From Jonathan Klein

As a reminder, these tests are run with Catchpoint. They use IE9, and they run from New York, London, Chicago, Seattle, and Miami every 30 minutes (so 240 runs per page per day). The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page. These numbers are medians, and here is the data for October 15th, 2014.


Unfortunately load time increased across all pages, with a larger increase on the search page. The overall increase is due to some additional assets that are served on every page. Specifically we added some CSS/JS for header/footer changes that we are experimenting with, as well as some new third party assets.

The larger increase on the search page is due to a redesign of that page, which increased the overall amount of CSS/JS on the page significantly. This pushed up both start render (extra CSS) and Webpage Response (additional page weight, more JS, etc.).

Seeing an increase of 100ms across the board isn’t a huge cause for concern, but the large increase on search warrants a closer look. Our search team is looking into ways to retain the new functionality while reducing the amount of CSS/JS that we have to serve.

Real User Front-End Performance – From Natalya Hoota

So, what are our users experiencing? Let us look at the RUM (Real User Monitoring) data collected by mPulse. First, median and 95% total page load time, measured in seconds.


As you see, real user data in Q3 showed an overall, and quite significant, performance regression on all pages.

Page load time can be viewed as a sum of back-end time (time until the browser receives the first byte of the response) and front-end time (time from the first byte until the page is finished rendering). Since we had a significant discrepancy between our RUM and synthetic data, we need a more detailed analysis.

Back-End Time – RUM Data

Our RUM metrics for server-side performance, defined as time to the first byte in mPulse, did not show a significant change since Q2, which matched closely with both our server-side performance analysis and synthetic data. The differences are small enough to be almost within rounding error, so there was effectively no regression.


Front-End Time – RUM Data

If synthetic data showed a slight increase in load time, our real user numbers showed this regression amplified. It could be a number of things: internal changes such as our experiments with header and footer, increased number of assets and their handling on clients side, or external factors. To figure out the reasons behind our RUM numbers, we asked ourselves a number of questions:

Q. Was our Q3 RUM data significantly larger than Q2 due to an increase in site traffic?
A. No. We found no immediate relation between performance degradation and a change in beacon volume. We double-checked with our internally recorded traffic patterns and found that beacon count is consistent with it, so we crossed this hypothesis off the list.

Q. Has our user base breakdown changed?
A. Not according to the RUM data breakdown. We did a detailed breakdown of the dataset by region and operating system and, yet again, found that our traffic patterns did not change geographically. It also appears that we had more RUM signals coming from desktop in Q3 than in Q2.

What remains is taking a closer look at internal factors.

One clue was our profile page metrics. From the design perspective, the only change on that page was global header / footer experiments. For the profile page alone, those resulted in a 100% and 45% increase in the number of page CSS files and JS assets, respectively. The profile page suffered one of the highest increases in median and 95% load time.

We already know that global header and footer experiments, along with additional CSS assets served, affected all pages to some degree. It is possible that in other pages the change was balanced out by architecture and design improvements, while in profile page we are seeing an isolated impact of this change.

To prove or dismiss this, we’ve learned that we need better analysis tooling than we currently have. Our synthetic tests run only on signed-out pages, hence it would be more accurate to compare them to similar RUM data set. However, we are currently unable to filter signed-in and signed-out RUM user data per page. We are planning to add this feature early in 2015, which will give us the ability to better connect a particular performance regression to a page experiment that caused it.

Lastly, for much of the quarter we were experiencing a persistent issue where assets failed to be served in a gzipped format from one of our CDNs. This amplified the impact of the asset growth, causing a significant performance regression in the front-end timing. This issue has been resolved, but it was present on the day when the data for this report was pulled.

Conclusion – From Jonathan Klein

The summary of this report is “faster back-end, slower front-end”. As expected the memcached upgrade was a boon for server side numbers, and the continued trend of larger web pages and more requests didn’t miss Etsy this quarter. These results are unsurprising – server hardware continues to get faster and we can reap the benefits of these upgrades. On the front-end, we have to keep pushing the envelope from a feature and design point of view, and our users respond positively to these changes, even if our pages don’t respond quite as quickly. We also learned this quarter that it’s time to build better front-end performance monitoring and data analysis tools for our site, and we’ll be focusing on this early in 2015.

The continuing challenge for us is to deliver the feature rich, beautiful experiences that our users enjoy while still keeping load time in check. We have a few projects in the new year that will hopefully help here: testing HTTP/2, a more nuanced approach to responsive images, and some CSS/JS/font cleanup.