Q3 2014 Site Performance Report

Posted by , and on December 22, 2014 / 6 Comments

We’re well into the fourth quarter of 2014, and it’s time to update you on how we did in Q3. This is either a really late performance report or an early Christmas present – you decide! The short version is that server side performance improved due to infrastructure changes, while front-end performance got worse because of third party content and increased amounts of CSS/JS across the site. In most cases the additional CSS/JS came from redesigns and new features that were part of winning variants that we tested on site. We’ve also learned that we need some better front-end monitoring and analysis tools to detect regressions and figure out what caused them.

We are also trying something new for this post – the entire performance team is involved in writing it! Allison McKnight is authoring the server side performance section, Natalya Hoota is updating us on the real user monitoring numbers, and Jonathan Klein is doing synthetic monitoring, this introduction, and the conclusion. Let’s get right to it – enter Allison:

Server Side Performance – From Allison McKnight

Here are median and 95th percentile times for signed-in users on October 22nd:



The homepage saw a significant jump in median load times as well as an increase in 95th percentile load times. This is due to our launch of the new signed-in homepage in September. The new homepage shows the activity feed of the signed-in user and requires more back-end time to gather that data. While the new personalization on the homepage did create a performance regression, the redesign had a significant increase in business and engagement metrics. The redesign also features items from a more diverse set of shops on the homepage and displays content that is more relevant to our users. The new version of the homepage is a better user experience, and we feel that this justifies the increase in load time. We are planning to revisit the homepage to further tune its performance in the beginning of 2015.

The rest of the pages saw a nice decrease in both median and 95th percentile back-end times. This is due to a rollout of some new hardware – we upgraded our production memcached boxes.

We use memcached for a lot of things at Etsy. In addition to caching the results of queries to the database, we use memcached to cache things like a user’s searches or activity feed. If you read our last site performance report, you may remember that in June we rolled out a dedicated memcached cluster for our listing cards, landing a 100% cache hit rate for listing data along with a nice drop in search page server side times.

Our old memcached boxes had 1Gb networking and 48GB of memory. We traded them out for new machines with 10Gb networking, 128GB of memory, faster clocks, and a newer architecture. After the switch we saw a drop in memcached latency. This resulted in a visible drop in both the median and 95th percentile back-end times of our pages, especially during our peak traffic times. This graph compares the profile page’s median and 95th percentile server side times on the day after the new hardware was installed and on the same day a week before the switch:


All of the pages in this report saw a similar speedup.

At this point, you may be wondering why the search page has seen such a small performance improvement since July if it was also affected by the new memcached machines. In fact, the search page saw about the same speedup from the new boxes as the rest of the pages did, but this speedup was balanced out by other changes to the page throughout the past three months. A lot of work has been going on with the search page recently, and it has proven hard to track down any specific changes that have caused regressions in performance. All told, in the past three months we’ve seen a small improvement in the search page’s performance and many changes and improvements to the page from the Search team. Since this is one of our faster pages, we’re happy with a small performance improvement along with the progress we’ve made towards our business goals in the last quarter.

Synthetic Front-End Performance – From Jonathan Klein

As a reminder, these tests are run with Catchpoint. They use IE9, and they run from New York, London, Chicago, Seattle, and Miami every 30 minutes (so 240 runs per page per day). The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page. These numbers are medians, and here is the data for October 15th, 2014.


Unfortunately load time increased across all pages, with a larger increase on the search page. The overall increase is due to some additional assets that are served on every page. Specifically we added some CSS/JS for header/footer changes that we are experimenting with, as well as some new third party assets.

The larger increase on the search page is due to a redesign of that page, which increased the overall amount of CSS/JS on the page significantly. This pushed up both start render (extra CSS) and Webpage Response (additional page weight, more JS, etc.).

Seeing an increase of 100ms across the board isn’t a huge cause for concern, but the large increase on search warrants a closer look. Our search team is looking into ways to retain the new functionality while reducing the amount of CSS/JS that we have to serve.

Real User Front-End Performance – From Natalya Hoota

So, what are our users experiencing? Let us look at the RUM (Real User Monitoring) data collected by mPulse. First, median and 95% total page load time, measured in seconds.


As you see, real user data in Q3 showed an overall, and quite significant, performance regression on all pages.

Page load time can be viewed as a sum of back-end time (time until the browser receives the first byte of the response) and front-end time (time from the first byte until the page is finished rendering). Since we had a significant discrepancy between our RUM and synthetic data, we need a more detailed analysis.

Back-End Time – RUM Data

Our RUM metrics for server-side performance, defined as time to the first byte in mPulse, did not show a significant change since Q2, which matched closely with both our server-side performance analysis and synthetic data. The differences are small enough to be almost within rounding error, so there was effectively no regression.


Front-End Time – RUM Data

If synthetic data showed a slight increase in load time, our real user numbers showed this regression amplified. It could be a number of things: internal changes such as our experiments with header and footer, increased number of assets and their handling on clients side, or external factors. To figure out the reasons behind our RUM numbers, we asked ourselves a number of questions:

Q. Was our Q3 RUM data significantly larger than Q2 due to an increase in site traffic?
A. No. We found no immediate relation between performance degradation and a change in beacon volume. We double-checked with our internally recorded traffic patterns and found that beacon count is consistent with it, so we crossed this hypothesis off the list.

Q. Has our user base breakdown changed?
A. Not according to the RUM data breakdown. We did a detailed breakdown of the dataset by region and operating system and, yet again, found that our traffic patterns did not change geographically. It also appears that we had more RUM signals coming from desktop in Q3 than in Q2.

What remains is taking a closer look at internal factors.

One clue was our profile page metrics. From the design perspective, the only change on that page was global header / footer experiments. For the profile page alone, those resulted in a 100% and 45% increase in the number of page CSS files and JS assets, respectively. The profile page suffered one of the highest increases in median and 95% load time.

We already know that global header and footer experiments, along with additional CSS assets served, affected all pages to some degree. It is possible that in other pages the change was balanced out by architecture and design improvements, while in profile page we are seeing an isolated impact of this change.

To prove or dismiss this, we’ve learned that we need better analysis tooling than we currently have. Our synthetic tests run only on signed-out pages, hence it would be more accurate to compare them to similar RUM data set. However, we are currently unable to filter signed-in and signed-out RUM user data per page. We are planning to add this feature early in 2015, which will give us the ability to better connect a particular performance regression to a page experiment that caused it.

Lastly, for much of the quarter we were experiencing a persistent issue where assets failed to be served in a gzipped format from one of our CDNs. This amplified the impact of the asset growth, causing a significant performance regression in the front-end timing. This issue has been resolved, but it was present on the day when the data for this report was pulled.

Conclusion – From Jonathan Klein

The summary of this report is “faster back-end, slower front-end”. As expected the memcached upgrade was a boon for server side numbers, and the continued trend of larger web pages and more requests didn’t miss Etsy this quarter. These results are unsurprising – server hardware continues to get faster and we can reap the benefits of these upgrades. On the front-end, we have to keep pushing the envelope from a feature and design point of view, and our users respond positively to these changes, even if our pages don’t respond quite as quickly. We also learned this quarter that it’s time to build better front-end performance monitoring and data analysis tools for our site, and we’ll be focusing on this early in 2015.

The continuing challenge for us is to deliver the feature rich, beautiful experiences that our users enjoy while still keeping load time in check. We have a few projects in the new year that will hopefully help here: testing HTTP/2, a more nuanced approach to responsive images, and some CSS/JS/font cleanup.


We Invite Everyone at Etsy to Do an Engineering Rotation: Here’s why

Posted by on December 22, 2014 / 16 Comments

At Etsy, it’s not just engineers who write and deploy code – our designers and product managers regularly do too. And now any Etsy employee can sign up for an “engineering rotation” to get a crash course in how Etsy codes, and ultimately work with an engineer to write and deploy the code that adds their photo to our about page. In the past year, 70 employees have completed engineering rotations. Our engineers have been pushing on day one for a while now, but it took a bit more work to get non-coders prepared to push as soon as their second week. In this post I’ll explain why we started engineering rotations and what an entire rotation entails.

What are rotations and why are they important?

Since 2010, Etsy employees have participated in “support rotations” every quarter, where they spend about two hours replying to support requests from our members. Even our CEO participates. What started as a way to help our Member Operations team during their busiest time of year has evolved into a program that facilitates cross-team communication, builds company-wide empathy, and provides no shortage of user insights or fun!

This got us thinking about starting up engineering rotations, where people outside of the engineering organization spend some time learning how the team works and doing some engineering tasks. Armed with our excellent continuous deployment tools, we put together a program that could have an employee with no technical knowledge deploying code live to the website in three hours. This includes time spent training and the deploy itself.

The Engineering Rotation Program

The program is split into three parts: homework; an in-person class; then hands-on deployment. The code that participants change and deploy will add their photos to the Etsy about page. It’s a nice visual payoff, and lets new hires publicly declare themselves part of the Etsy team.

Before class begins, we assign homework in order to prepare participants for the code change they’ll deploy. We ask them to complete interactive tutorials, including HTML levels 1 and 2 on Code Academy and our in-house Unix Command Line 101. We also ask them to read Marc Cohen’s excellent article “How the Web Works – In One Easy Lesson,” followed by a blog post on Etsy.com discussing how the engineering organization deals with outages. These resources help familiarize each participant with the technologies they’ll work with, and introduce them to some of Etsy’s core engineering tenets, such as blameless post-mortems.


Next up is a class for all participants. It has five sections. The first section picks up where How The Web Works left off and explains how Etsy works. Then we introduce the standard three-tier architecture and walk through some example requests: viewing, creating, searching for and purchasing a listing. Next we take a deep-dive into database sharding. We explain what it is, why it’s necessary, why we shard by data owner and how we rebalance our shards. We then explain Content Delivery Networks and why we use them. After that, we move away from the hard technical discussion to talk about continuous deployment. We discuss the philosophy behind it, and describe why it’s safe to change the website fifty times per day and how we ensure that each change does exactly what we expect. We wrap up this session by giving an overview of all the engineering teams at Etsy and their responsibilities.

At this point we pair each participant with an engineer who will guide them through the process of making and testing the code change, and ultimately pressing the big green Deploy to Production button. These one-on-one sessions can take up to two hours as the pair discuss the different tools that exist at each step – some as simple as IRC or as complex as our Continuous Integration cluster. As the participant begins the process of deploying their code change, they’ll see their name appear atop our dashboards.

What have we learned?

The benefits of engineering rotations parallel those of our support rotations in many ways. It’s an opportunity for Admin throughout Etsy to work with people they normally wouldn’t, and to learn more about each other personally and professionally. To an outsider, the more technical aspects of Etsy might feel unapproachable – even a bit mysterious – but demystifying them encourages even more collaboration. Here’s what some of the participants of Etsy’s Engineering Rotations have said:

“Understanding the work of your colleagues breeds empathy and it’s been great having a better understanding of what working at Etsy means to others.”

“The best part about this program is that it pulls back the curtain on how software development works at Etsy. For people who don’t work with code every day, this can appear to be some sort of magic, which it’s not – it’s just a different kind of work, with different kinds of tools. Without this program, we would miss out on a huge opportunity for different groups to empathize with each other, which I think is crucial for a company to feel like a real team.”

“The internet goes under the sea in a cable. Whoa.”

Participants in both the engineering and support rotations come away with many lessons beyond the curriculum. More than a few times, support rotations have exposed engineers to parts of the site that generate lots of inquiries, and they were able to fix them immediately. And in one engineering rotation, someone pointed out that a lot of IRC tools we’ve built can’t be used by all employees because they don’t have access to our internal code snippet service. So we’re now looking at how we can give everyone that access. I led one session with our International Counsel, and we ended up having a fascinating discussion about the legality of deleting data. That sprang from my explanation of how we do database migrations!

But perhaps the biggest thing we’ve learned from the engineering rotations is that everyone involved likes doing them. They get to meet new people, learn new things, and use a tool called “the Deployinator.” What’s not to like?

You can follow Dan on Twitter @jazzdan.


Make Performance Part of Your Workflow

Posted by on December 11, 2014 / 1 Comment

Designing for PerformanceThe following is an excerpt from Chapter 7, “Weighing Aesthetics and Performance”, from Designing for Performance by Lara Callender Hogan (Etsy’s Senior Engineering Manager of Performance), which has just been released by O’Reilly.

One way to minimize the operational cost of performance work is to incorporate it into your daily workflow by implementing tools and developing a routine of benchmarking performance.

There are a variety of tools mentioned throughout this book that you can incorporate into your daily development workflow:

By making performance work part of your daily routine and automating as much as possible, you’ll be able to minimize the operational costs of this work over time. Your familiarity with tools will increase, the habits you create will allow you to optimize even faster, and you’ll have more time to work on new things and teach others how to do performance right.

Your long-term routine should include performance as well. Continually benchmark improvements and any resulting performance gains as part of your project cycle so you can defend the cost of performance work in the future. Find opportunities to repurpose existing design patterns and document them. As your users grow up, so does modern browser technology; routinely check in on your browser-specific stylesheets, hacks, and other outdated techniques to see what you can clean up. All of this work will minimize the operational costs of performance work over time and allow you to find more ways to balance aesthetics and performance.

Approach New Designs with a Performance Budget

One key to making decisions when weighing aesthetics and page speed is understanding what wiggle room you have. By creating a performance budget early on, you can make performance sacrifices in one area of a page and make up for them in another. In Table 7-3 I’ve illustrated a few measurable performance goals for a site.

TABLE 7-3. Example performance budget

Total page load time 2 seconds WebPagetest, median from five runs on 3G All pages
Total page load time 2 seconds Real user monitoring tool, median across geographies All pages
Total page weight 800 Kb WebPagetest All pages
Speed Index 1,000 WebPagetest using Dulles location in Chrome on 3G All pages except home page
Speed Index 600 WebPagetest using Dulles location in Chrome on 3G Home page

You can favor aesthetics in one area and favor performance in another by defining your budget up front. That way, it’s not always about making choices that favor page speed; you have an opportunity to favor more complex graphics, for example, if you can find page speed wins elsewhere that keep you within your budget. You can call a few more font weights because you found equivalent savings by removing some image requests. You can negotiate killing a marketing tracking script in order to add a better hero image. By routinely measuring how your site performs against your goals, you can continue to find that balance.

To decide on what your performance goals will be, you can conduct a competitive analysis. See how your competitors are performing and make sure your budget is well below their results. You can also use industry standards for your budget: aim for two seconds or less total page time, as you know that’s how fast users expect sites to load.

Iterate upon your budget as you start getting better at performance and as industry standards change. Continue to push yourself and your team to make the site even faster. If you have a responsively designed site, determine a budget for your breakpoints as well, like we did in Chapter 5.

Your outlined performance goals should always be measureable. Be sure to detail the specific number to beat, the tool you’ll use to measure it, as well as any details of what or whom you’re measuring. Read more about how to measure performance in Chapter 6, and make it easy for anyone on your team to learn about this budget and measure his or her work against it.

Designing for Performance by Lara Callender Hogan
ISBN 978-1-4919-0251-6
Copyright 2014 O’Reilly Media, Inc. All right reserved. Used with permission.

1 Comment

Juggling Multiple Elasticsearch Instances on a Single Host

Posted by on December 4, 2014 / 5 Comments

Elasticsearch is a distributed search engine built on top of Apache Lucene. At Etsy we use Elasticsearch in a number of different configurations: for Logstash, powering user-facing search on some large indexes, some analytics usage, and many internal applications.

Typically, it is assumed that there is a 1:1 relationship between ES instances and machines. This is straightforward and makes sense if your instance requirements line up well with the host – whether physical, virtualized or containerized. We run our clusters on bare metal, and for some of them we have more ES instances than physical hosts. We have good reasons for doing this, and here I’ll share some of the rationale, and the configuration options that we’ve found to be worth tuning.


Managing JVMs with large heaps is scary business due to garbage collection run times. 31Gb is the magic threshold above which point you lose the ability to use CompressedOops. In our experience, it is better to have even smaller heaps. Not only do GC pause times stay low, but it’s easier to capture and analyze heap dumps!

To get optimal Lucene performance, it is also important to have sufficient RAM available for OS file caching of the index files.

At the same time, we are running on server-class hardware with plenty of CPU cores and RAM. Our newest search machines are Ivy Bridge with 20 physical (40 virtual) cores, 128Gb of RAM, and of course SSDs. If we run a single node with a small heap on this hardware we would be wasting both CPU and RAM, because the size of shards such an instance will be able to support will also be smaller.

We currently run 4 ES JVMs per machine with 8Gb of heap each. This works out great for us: GC has not been a concern and we are utilizing our hardware effectively.

The settings


Elasticsearch uses this setting to configure thread pool and queue sizing. It defaults to Runtime.getRuntime().availableProcessors(). With multiple instances, it is better to spread the CPU resources across them.

We set this to ($(nproc) / $nodes_per_host). So if we are running 4 nodes per host on 40-core machines, each of them will configure thread pools and queues as if there were 10 cores.


The default for this setting is to pick a random Marvel comic character at startup. In production, we want something that lets us find the node we want with as little thought and effort as possible. We set this to $hostname-$nodeId (which results in names like “search64-es0” – less whimsical, but far more practical when you’re trying to get to the bottom of an issue).

http.port, transport.port

If these ports are not specified, ES tries to pick the next available port at startup, starting at a base of 9200 for HTTP and 9300 for its internal transport.

We prefer to be explicit and assign ports as $basePort+$nodeIdx from the startup script. This can prevent surprises such as where an instance that you expect to be down is still bound to its port, causing the ‘next available’ one to be higher than expected.


A key way to achieve failure tolerance with ES is to use replicas, so that if one host goes down, the affected shards stay available. If you’re running multiple instances on each physical host, it’s entirely possible to automatically allocate all replicas for a shard to the same host, which isn’t going to help you! Thankfully this is avoidable with the use of shard allocation awareness. You can set the hostname as a node attribute on each instance and use that attribute as a factor in shard assignments.

ES_JAVA_OPTS="$ES_JAVA_OPTS -Des.cluster.routing.allocation.awareness.attributes=host"


Without having a dedicated log directory for each instance, you would end up with multiple JVMs trying to write to the same log files.

An alternative, which we rely on, is to prefix the filenames in logging.yml with the property ${node.name} so that each node’s logs are labelled by host and node ID. Another reason to be explicit about node naming!

Minimal Viable Configuration

Elasticsearch has lots of knobs, and it’s worth trying to minimize configuration. That said, a lot of ES is optimized for cloud environments so we occasionally find things worth adjusting, like allocation and recovery throttling. What do you end up tweaking?

You can follow Shikhar on Twitter at @shikhrr


Personalized Recommendations at Etsy

Posted by on November 17, 2014 / 4 Comments

Providing personalized recommendations is important to our online marketplace.  It benefits both buyers and sellers: buyers are shown interesting products that they might not have found on their own, and products get more exposure beyond the seller’s own marketing efforts.  In this post we review some of the methods we use for making recommendations at Etsy.  The MapReduce implementations of all these methods are now included in our open-source machine learning package “Conjecture” which was described in a previous post.

Computing recommendations basically consists of two stages.  In the first stage we build a model of users’ interests based on a matrix of historic data, for example, their past purchases or their favorite listings (those unfamiliar with matrices and linear algebra see e.g., this  review).  The models provide vector representations of users and items, and their inner products give an estimate of the level of interest a user will have in the item (higher values denote a greater degree of estimated interest).  In the second stage, we compute recommendations by finding a set of items for each user which approximately maximizes the estimate of the interest.

The model of users and items can be also used in other ways, such as finding users with similar interests, items which are similar from a “taste” perspective, items which complement each other and could be purchased together, etc.

Matrix Factorization

The first stage in producing recommendations is to fit a model of users and items to the data.  At Etsy, we deal with “implicit feedback” data where we observe only the indicators of users’ interactions with items (e.g., favorites or purchases).  This is in contrast to “explicit feedback” where users give ratings (e.g. 3 of 5 stars) to items they’ve experienced. We represent this implicit feedback data as a binary matrix, the elements are ones in the case where the user liked the item (i.e., favorited it) or a zero if they did not.  The zeros do not necessarily indicate that the user is not interested in that item, but only that they have not expressed an interest so far.  This may be due to disinterest or indifference, or due to the user not having seen that item yet while browsing.



An implicit feedback dataset in which a set of users have “favorited” various items, note that we do not observe explicit dislikes, but only the presence or absence of favorites


The underpinning assumption that matrix factorization models make is that the affinity between a user and an item is explained by a low-dimensional linear model.  This means that each item and user really corresponds to an unobserved real vector of some small dimension.  The coordinates of the space correspond to latent features of the items (these could be things like: whether the item is clothing, whether it has chevrons, whether the background of the picture is brown etc.), the elements for the user vector describe the users preferences for these features.  We may stack these vectors into matrices, one for users and one for items, then the observed data is in theory generated by taking the product of these two unknown matrices and adding noise:


The underpinning low dimensional model from which the observed implicit feedback data is generated, “d” is the dimension of the model.


We therefore find a vector representation for each user and each item.  We compute these vectors so that the inner product between a user vector and item vector will approximate the observed value in the implicit feedback matrix (i.e., it will be close to one in the case the user favorited that item and close to zero if they didn’t).


The results of fitting a two dimensional model to the above dataset, in this small example the the first discovered features roughly corresponds to whether the item is a shelf or not, and the second to whether it is in a “geometric” style.


Since the zeros in the matrix do not necessarily indicate disinterest in the item, we don’t want to force the model to fit to them, since the user may actually be interested in some of those items.  Therefore we find the decomposition which minimizes a weighted error function, where the weights for nonzero entries in the data matrix are higher than those of the zero entries.  This follows a paper which suggested this method.  How to set these weights depends on how sparse the matrix is, and could be found through some form of cross validation.

What happens when we optimize the weighted loss function described above, is that the reconstructed matrix (the product of the two factors) will often have positive elements where the input matrix has zeros, since we don’t force the model to fit to these as well as to the non-zeros.  These are the items which the user may be interested in but has not seen yet.  The reason this happens is that in order for the model to fit well, users who have shown interest in overlapping sets of items will have similar vectors, and likewise for items.  Therefore the unexplored items which are liked by other users with similar interests will often have a high value in the reconstructed matrix.

Alternating Least Squares

To optimize the model, we alternate between computing item matrix and user matrix, and at each stage we minimize the weighted squared error, holding the other matrix fixed (hence the name “alternating least squares”).  At each stage, we can compute the exact minimizer of the weighted square error, since an analytic solution is available.  This means that each iteration is guaranteed not to increase the total error, and to decrease it unless the two matrices already constitute a local minimum of the error function.  Therefore the entire procedure gradually decreases the error until a local minimum is reached.  The quality of these minima can vary, so it may be a reasonable idea to repeat the procedure and select the best one, although we do not do this.  A demo of this method in R is available here.

This computation lends itself very naturally to implementation in MapReduce, since e.g., when updating a vector for a user, all that is needed are the vectors for the items which he has interacted with, and the small square matrix formed by multiplying the items matrix by its own transpose.  This way the computation for each user typically can be done even with limited amounts of memory available, and each user may be updated in parallel.  Likewise for updating items.  There are some users which favorite huge numbers of items and likewise items favorited by many users, and those computations require more memory.  In these cases we can sub-sample the input matrix, either by filtering out these items, or taking only the most recent favorites for each user.

After we are satisfied with the model, we can continue to update it as we observe more information, by repeating a few steps of the alternating least squares every night, as more items, users, and favorites come online.  New items and users can be folded into the model easily, so long as there are sufficiently many interactions between them and existing users and items in the model respectively.  Productionizable MapReduce code for this method is available here.

Stochastic SVD

The alternating least squares described above gives us an easy way to factorize the matrix of user preferences in MapReduce. However, this technique has the disadvantage of requiring several iterations, sometimes taking a long time to converge to a quality solution. An attractive alternative is the Stochastic SVD.  This is a recent method which approximates the well-known Singular Value Decomposition of a large matrix, and which admits a non iterative MapReduce implementation.  We implement this as a function which can be called from any scalding Hadoop MapReduce job.

A fundamental result in linear algebra is that the matrix formed by truncating the singular value decomposition after some number of dimensions is the best approximation to that matrix (in terms of square error) among all matrices of that rank.  However we note that using this method we cannot do the same “weighting” to the error as we did when optimizing via alternating least squares.  Nevertheless for datasets where the zeros do not completely overwhelm the non-zeros then this method is viable.  For example we use it to build a model from the favorites, whereas it fails to provide a useful model from purchases which are much more sparse, and where this weighting is necessary.

An advantage of this method is that it produces matrices with a nice orthonormal structure, which makes it easy to construct the vectors for new users on the fly (outside of a nightly recomputation of the whole model), since no matrix inversions are required.  We also use this method to produce vector representations of other lists of items besides those a user favorited, for example treasuries and other user curated lists on Etsy.  This way we may suggest other relevant items for those lists.

Producing Recommendations

Once we have a model of users and items we use it to build product recommendations.  This is a step which seems to be mostly overlooked in the research literature.  For example, we cannot hope to compute the product of the user and item matrices, and then find the best unexplored items for each user, since this requires time proportional to the product of the number of items and the number of users, both of which are in the hundreds of millions.

One research paper suggests using a tree data structure to allow for a non-exhaustive search of the space, by pruning away entire sets of items where the inner products would be too small.  However we observed this method to not work well in practise, possibly due to the curse of dimensionality with the type of models we were using (with hundreds of dimensions).

Therefore we use approximate methods to compute the recommendations.  The idea is to first produce a candidate set of items, then to rank them according to the inner products, and take the highest ones.  There are a few ways to produce candidates, for example, the listings from favorite shops of a user, or those textually similar to his existing favorites.  However the main way we use is “locality sensitive hashing” (LSH) where we divide the space of user and item vectors into several hash bins, then take the set of items which are mapped to the same bin as each user.

Locality Sensitive Hashing

Locality sensitive hashing is a technique used to find approximate nearest neighbors in large datasets.  There are several variants, but we focus on one designed to handle real-valued data and to approximate the nearest neighbors in the Euclidean distance.

The idea of the method is to partition the space into a set of hash buckets, so that points which are near to each other in space are likely to fall into the same bucket.  The way we do this is by constructing some number “p” of planes in the space so that they all pass through the origin.  This divides the space up into 2^p convex cones, each of which constitutes a hash bucket.

Practically we implement this by representing the planes in terms of their normal vectors.  The side of the plane that a point falls on is then determined by the sign of the inner product between the point and the normal vector (if the planes are random then we have non-zero inner products almost surely, however we could in principle assign those points arbitrarily to one side or the other).  To generate these normal vectors we just need directions uniformly at random in space.  It is well known that draws from an isotropic Gaussian distribution have this property.

We number the hash buckets so that the i^th bit of the hash-code is 1 if the inner product between a point and the i^th plane is positive, and 0 otherwise.  This means that each plane is responsible for a bit of the hash code.

After we map each point to its respective hash bucket, we can compute approximate nearest neighbors, or equivalently, approximate recommendations, by examining only the vectors in the bucket.  On average the number in each bucket will be 2^{-p} times the total number of points, so using more planes makes the procedure very efficient.  However it also reduces the accuracy of the approximation, since it reduces the chance that nearby points to any target point will be in the same bucket.  Therefore to achieve a good tradeoff between efficiency and quality, we repeat the hashing procedure multiple times, and then combine the outputs.  Finally, to add more control to the computational demands of the procedure, we throw away all the hash bins which are too large to allow efficient computation of the nearest neighbors.  This is implemented in Conjecture here.

Other Thoughts

Above are the basic techniques for generating personalized recommendations.  Over the course of developing these recommender systems, we found a few modifications we could make to improve the quality of the recommendations.


In summary we described how we can build recommender systems for e-commerce based on implicit feedback data.  We built a system which computes recommendations on Hadoop, which is now part of our open source machine learning package “Conjecture.”  Finally we shared some additional tweaks that can be made to potentially improve the quality of recommendations.


Building A Better Build: Our Transition From Ant To SBT

Posted by on September 30, 2014 / 7 Comments

A build tool is fundamental to any non-trivial software project.  A good build tool should be as unobtrusive as possible, letting you focus on the code instead of the mechanics of compilation.  At Etsy we had been using Ant to build our big data stack.  While Ant did handle building our projects adequately, it was a common source of questions and frustration for new and experienced users alike.  When we analyzed the problems users were having with the build process, we decided to replace Ant with SBT, as it was a better fit for our use cases.  In this post I’ll discuss the reasons we chose SBT as well as some of the details of the actual process of switching.

Why Did We Switch?

There were two perspectives we considered when choosing a replacement for Ant.  The first is that of a user of our big data stack.  The build tool should stay out of the way of these users as much as possible.  No one should ever feel it is preventing them from being productive, but instead that it is making them more productive.  SBT has a number of advantages in this regard:

  1. Built-in incremental compiler: We used the stand-alone incremental compiler zinc for our Ant build.  However, this required a custom Ant task, and both zinc and that task needed to be installed properly before users could start building.  This was a common source of questions for users new to our big data stack.  SBT ships with the incremental compiler and uses it automatically.
  2. Better environment handling: Because of the size of our code base, we need to tune certain JVM options when compiling it.  With Ant, these options had to be set in the ANT_OPTS environment variable.  Not having ANT_OPTS set properly was another common source of problems for users.  There is an existing community-supported SBT wrapper that solves this problem.  The JVM options we need are set in a .jvmopts file that is checked in with the code.
  3. Triggered execution: If you prefix any SBT command or sequence of commands with a tilde, it will automatically run that command every time a source file is modified.  This is a very powerful tool for getting immediate feedback on changes.  Users can compile code, run tests, or even launch a Scalding job automatically as they work.
  4. Build console: SBT provides an interactive console for running build commands.  Unlike running commands from the shell, this console supports tab-completing arguments to commands and allows temporarily modifying settings.

The other perspective is that of someone modifying the build.  It should be straightforward to make changes.  Furthermore, it should be easy to take advantage of existing extensions from the community.  SBT is also compelling in this regard:

  1. Build definition in Scala: The majority of the code for our big data stack is Scala.  With Ant, modifying the build requires a context switch to its XML-based definition.  An SBT build is defined using Scala, so no context switching is necessary.  Using Scala also provides much more power when defining build tasks.  We were able to replace several Ant tasks that invoked external shell scripts with pure Scala implementations.  Defining the build in Scala does introduce more opportunities for bugs, but you can use scripted to test parts of your build definition.
  2. Plugin system: To extend Ant, you have to give it a JAR file, either on the command line or by placing it in certain directories.  These JAR files then need to be made available to everyone using the build.  With SBT, all you need to do is add a line like
    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")

    to the build definition.  SBT will then automatically download that plugin and its dependencies.  By default SBT will download dependencies from Maven Central, but you can configure it to use only an internal repository:

    resolvers += resolver("internal-repo", "<repository URL>")
    externalResolvers <<= resolvers map { rs =>
        Resolver.withDefaultResolvers(rs, mavenCentral = false)
  3. Ease of inspecting the build: SBT has an “inspect” command that will provide a lot of information about a setting or task: the type, any related settings or tasks, and forward and reverse dependencies.  This is invaluable when trying to understand how some part of the build works.
  4. Debugging support: Most SBT tasks produce verbose debug logs that are not normally displayed.  The “last” command will output those logs, making tracking down the source of an error much easier.

This is not to say that SBT is perfect.  There are some aspects of SBT that are less than ideal:

  1. Terse syntax: The SBT build definition DSL is full of hieroglyphic symbols, such as the various assignment operators: <<=, <+=, and <++=.  This terse syntax can be intimidating for those that are not used to it.
  2. Support for compiling a subset of files: The Scala compiler can be slow, so our Ant build supported compiling only a subset of files.  This was very easy to add in Ant.  SBT required a custom plugin that delved into SBT’s internal APIs.  This does work, but we have already had to do some minor refactoring between SBT 0.13.1 and 0.13.2.  We have to accept that that such incompatibility is likely to occur between future minor versions.
  3. Less mature: SBT is a relatively new tool, and the plugin ecosystem is still growing.  As such, there were several times we had to write custom plugins for functionality that would likely already be supported by another tool.  We also experienced a couple of bugs in SBT itself during the switch.

At this point it’s fair to ask why SBT, and not Maven, Gradle, or some other tool.  You can certainly get features like these in other tools.  We did not do a detailed comparison of multiple tools to replace Ant.  SBT addressed our pain points with the Ant build, and it was already popular with users of our big data stack.

The Transition Process

It may sound cliché, but the initial hurdle for actually switching to SBT was gaining an in-depth understanding of the tool.  SBT’s terminology and model of a build is very different from Ant.  SBT’s documentation is very good and very thorough, however.  Starting the switch without having read through it would have resulted in a lot of wasted time.  It’s a lot easier to find answers online when you know how SBT names things!

The primary goal when switching was to have the SBT build be a drop-in replacement for the Ant build.  This removed the need to modify any external processes that use the build artifacts.  It also allowed us to have the Ant and SBT builds in parallel for a short time in case of bugs in the SBT build.  Unfortunately, our project layout did not conform to SBT’s conventions — but SBT is fully configurable in this regard.  SBT’s “inspect” command was helpful here for discovering which settings to tweak.  It was also good that SBT supports using an external ivy.xml file for dependency management.  We were already using Ivy with our Ant build, so we were able to define our dependencies in one place while running both builds in parallel.

With the configuration to match our project layout taken care of, actually defining the build with SBT was a straightforward task.  Unlike Ant, SBT has built-in tasks for common operations, like compiling, running tests, and creating a JAR.  SBT’s multi-project build support also cut down on the amount of configuration necessary.  We have multiple projects building together, and every single Ant task had to be defined to build them in the correct order.  Once we configured the dependencies between the projects in SBT, it automatically guaranteed this.

We also took advantage of SBT’s aliasing feature to make the switch easier on users.  The names of SBT’s built-in tasks did not always align with the names we had picked when defining the corresponding task in Ant.  It’s very easy to alias commands, however:

addCommandAlias("jar", "package")

With such aliases users were able to start using SBT just as they had used Ant, without needing to learn a whole set of new commands.  Aliases also make it easy to define sequences of commands without needing to create an entirely new task, such as

addCommandAlias("rebuild", ";clean; compile; package")

The process of switching was not entirely smooth, however.  We did run into two bugs in SBT itself.  The first is triggered when you define an alias that is the prefix of a task in the alias definition.  Tab-completing or executing that alias causes SBT to hang for some time and eventually die with a StackOverflowError.  This bug is easy to avoid if you define your alias names appropriately, and it is fixed in SBT 0.13.2.  The other bug only comes up with multi-module projects.  Even though our modules have many dependencies in common, SBT will re-resolve these dependencies for each module.  There is now a fix for this in SBT 0.13.6, the most recently released version, that can be enabled by adding

updateOptions := updateOptions.value.withConsolidatedResolution(true)

to your build definition.  We saw about a 25% decrease in time spent in dependency resolution as a result.

Custom Plugins

As previously mentioned, we had to write several custom plugins during the process of switching to SBT to reproduce all the functionality of our Ant build.  We are now open-sourcing two of these SBT plugins.  The first is sbt-checkstyle-plugin.  This plugin allows running Checkstyle over Java sources with the checkstyle task.  The second is sbt-compile-quick-plugin.  This plugin allows you to compile and package a single file with the compileQuick and packageQuick tasks, respectively.  Both of these plugins are available in Maven Central.


Switching build tools isn’t a trivial task, but it has paid off.  Switching to SBT has allowed us to address multiple pain points with our previous Ant build.  We’ve been using SBT for several weeks now.  As with any new tool, there was a need for some initial training to get everyone started.  Overall though, the switch has been a success!   The number of issues encountered with the build has dropped.  As users learn SBT they are taking advantage of features like triggered execution to increase their productivity.


Expanding Diversity Efforts with Hacker School

Posted by on September 25, 2014 / 2 Comments

Today we’re proud to announce that Etsy will provide $210,000 in Etsy Hacker Grants to Hacker School applicants in the coming year. These grants extend our support of Hacker School’s diversity initiatives, which first focused on the gender imbalance of applicants to their program and in the wider industry, and will now expand to support applicants with racial backgrounds that are underrepresented in software engineering.

The grants will provide up to $7,000 in support for at least 35 accepted applicants in the next three batches of Hacker School, and are designed to help with a student’s living expenses during their three-month curriculum in New York City.

Diversity and opportunity lie at the heart of what Etsy is about, a place where anyone can start a business for $0.20. Today we wrote a post talking about how we think about diversity at Etsy: More Than Just Numbers. As an engineering team, diversity and opportunity are core values for us as well — we believe a diverse environment is a resilient one. We love what we do and want to extend that opportunity to anyone who wants it.

This doesn’t mean the work to support diversity is easy or that we’ve mastered it — but we are committed to continuing to improve. Over the years, we’ve focused on educational programs looking at unconscious bias, bringing in speakers from NCWIT, and building out our internal leadership courses to support a broad swath of new leaders.

Hacker School grants have been one of our favorite and most effective programs since sponsoring our first batch of students in summer 2012. We’ve even given talks about how well it went. Hacker School’s selection process and environment combine to create a group of students diverse across a number of axes, including gender, race, experience and technical background, but that are also culturally and technically aligned with Etsy’s engineering team. Hacker School’s welcoming “programming retreat” approach produces the sort of broad, deep, tool-building and curious system engineers that work well in our distributed, iterative, transparent and scaled environment. We have Hacker School alums across almost every team in Etsy engineering and at nearly every level, from just starting their first job to very senior engineers.

We know that racial diversity is a complicated issue, and we are by no means the experts. But we believe that together with Hacker School we are making a small step in the right direction.

And we need your help. This program only works if qualified applicants hear that it’s happening, and know that we really want them to apply. If you know someone who you think would be great, please let them know, and encourage them to apply to an upcoming batch!


Come find Etsy at Velocity NY 2014

Posted by on September 10, 2014

Velocity is our kind of conference, and Velocity NY happens in our backyard in that funny little borough on the other side of the river. (Manhattan) Make sure to come find us, we’ll be there teaching, speaking, and keynoting:

Monday 9am “PostMortem Facilitation: Theory and Practice of “New View” Debriefings” – John Allspaw

Monday 1:30pm “Building a Device Lab” – Lara Swanson and Destiny Montague

Tuesday 1:15pm “A Woman’s Place is at the Command Prompt” – Lara Swanson (moderator), Katherine Daniels (Etsy), Jennifer Davis (Chef), Bridget Kromhout (DramaFever), Samantha Thomas (UMN)

Tuesday 3:30pm “It’s 3AM, Do You Know Why You Got Paged?” – Ryan Frantz

Tuesday 5pm “Etsy’s Journey to Building a Continuous Integration Infrastructure for Mobile Apps” – Nassim Kammah

Wednesday 11:20am “Unpacking the Black Box: Benchmarking JS Parsing and Execution on Mobile Devices” – Daniel Espeset

Holding office hours

Nassim Kammah of our continuous integration team, the Autobot

Ryan Frantz our sleep and alert design obsessed operations engineer

John Allspaw, he should look familiar to you by now

Signing books

Jon Cowie will be signing his book “Customizing Chef”
Lara Swanson will be signing galleys of “Designing for Performance”
John Allspaw will be signing “Web Operations” and “Capacity Planning”

No Comments

Teaching Testing: Our Testing 101 Materials

Posted by on August 20, 2014 / 2 Comments

Etsy engineers have a wide variety of backgrounds, strengths, and weaknesses, so there are no engineering skills we can take for granted. And there are things you can’t just assume engineers will learn for themselves because you throw a codebase and a workflow at them.

I work on Etsy’s continuous deployment team, which advises on automated testing of our code, and I felt that we could use some stronger means of teaching (and establishing as a conscious value) the skills of testing and design in code. To that end, I recently wrote two “Testing 101” materials for use by all Etsy engineers. They’re now both on our public Github: the Etsy Testing Best Practices Guide, and our Testing 101 Code Lab for hands-on practice applying its ideas. Both use PHP and PHPUnit.

We called it the “Testing Best Practices Guide” because we love misnomers. It’s more about design than testing, it describes few concrete practices, and we don’t call any of them “best” .

Within Etsy, we supplement mere documents with activities like team rotations (“bootcamps”) for new hires, technical talks, and dojos (collaborative coding exercises) to practice and have fun with coding topics as a group. And of course, we do code review.

Deeper technical considerations are often invisible in code, so you have to find a way, whether by process, tooling, or teaching, to make them visible.


Q2 2014 Site Performance Report

Posted by on August 1, 2014 / 3 Comments

As the summer really starts to heat up, it’s time to update you on how our site performed in Q2. The methodology for this report is identical to the Q1 report. Overall it’s a mixed bag: some pages are faster and some are slower. We have context for all of it, so let’s take a look.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, July 16th:

Server Side Performance

A few things stand out in this data:

Median homepage performance improved, but the 95th percentile got slower. This is due to a specific variant we are testing which is slower than the current page. We made some code changes that improved load time for the majority of pageviews, but the one slower test variant brings up the higher percentiles.

The listing page saw a fairly large increase in both median and 95th percentile load time. There isn’t a single smoking gun for this, but rather a number of small changes that caused little increases in performance over the last three months.

Search saw a significant decrease across the board. This is due to a dedicated memcached cluster that we rolled out to cache “listing cards” on the search results page. This brought our cache hit rate for listing related data up to 100%, since we automatically refresh the cache when the data changes. This was a nice win that will be sustainable over the long term.

The shop page saw a big jump at the 95th percentile. This is again due to experiments we are running on this page. A few of the variants we are testing for a shop redesign are slower than the existing page, which has a big impact on the higher percentiles. It remains to be seen which of these variants will win, and which version of the page we will end up with.

Overall we saw more increases than decreases on the backend, but we had a couple of performance wins from code/architecture changes, which is always nice to see. Looking ahead, we are planning on replacing the hardware in our memcached cluster in the next couple of months, and tests show that this should have a positive performance impact across the entire site.

Synthetic Front-end Performance

As a reminder, these tests are run with Catchpoint. They use IE9, and they run from New York, London, Chicago, Seattle, and Miami every two hours. The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page. Here is that data:

Synthetic Performance

The render start metrics are pretty much the same across the board, with a couple of small decreases that aren’t really worth calling out due to rounding error and network variability. The “webpage response” numbers, on the other hand, are up significantly across the board. This is easily explained: we recently rolled out full site TLS, and changed our synthetic tests to hit https URLs. The added TLS negotiation time for all assets on the page bumped up the overall page load time everywhere. One thing we noticed with this change is that due to most browsers making six TCP connections per domain, we pay this TLS negotiation cost many times per page. We are actively investigating SPDY with the goal of sending all of our assets over one connection and only doing this negotiation once.

Real User Front-end Performance

As always, these numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

RUM Data

One change here is that we are showing an extra significant figure in the RUM data. We increased the number of beacons that we send to mPulse, and our error margin dropped to 0.00 seconds, so we feel confident showing 10ms resolution. We see the expected drop in search load time because of the backend improvement, and everything else is pretty much neutral. The homepage ticked up slightly, which is expected due to the experiment that I mentioned in the server side load time section.

One obvious question is: “Why did the synthetic numbers change so much while the RUM data is pretty much neutral?”. Remember, the synthetic numbers changed primarily because of a change to the tests themselves. The switch to https caused a step change in our synthetic monitoring, but for real users the rollout was gradual. In addition, real users that see more than one page have some site resources in their browser cache, mitigating some of the extra TLS negotiations. Our synthetic tests always operate with an empty cache, which is a bit unrealistic. This is one of the reasons why we have both synthetic and RUM metrics: if one of them looks a little wonky we can verify the difference with other data. Here’s a brief comparison of the two, showing where each one excels:

Synthetic Monitoring Real User Monitoring
Browser Instrumentation Navigation Timing API
Consistent Trending over time Can be highly variable as browsers and networks change
Largely in your control Last mile difficulties
Great for identifying regressions Great for comparing across geographies/browsers
Not super realistic from an absolute number point of view “Real User Monitoring”
A/B tests can show outsized results due to empty caches A/B tests will show the real world impact


This report had some highs and some lows, but at the end of the day our RUM data shows that our members are getting roughly the same experience they were a few months ago performance wise, with faster search results pages. We’re optimistic that upgrading our memcached cluster will put a dent in our backend numbers for the next report, and hopefully some of our experiments will have concluded with positive results as well. Look for another report from us as the holiday season kicks off!