Make Performance Part of Your Workflow

Posted by on December 11, 2014 / 1 Comment

Designing for PerformanceThe following is an excerpt from Chapter 7, “Weighing Aesthetics and Performance”, from Designing for Performance by Lara Callender Hogan (Etsy’s Senior Engineering Manager of Performance), which has just been released by O’Reilly.

One way to minimize the operational cost of performance work is to incorporate it into your daily workflow by implementing tools and developing a routine of benchmarking performance.

There are a variety of tools mentioned throughout this book that you can incorporate into your daily development workflow:

By making performance work part of your daily routine and automating as much as possible, you’ll be able to minimize the operational costs of this work over time. Your familiarity with tools will increase, the habits you create will allow you to optimize even faster, and you’ll have more time to work on new things and teach others how to do performance right.

Your long-term routine should include performance as well. Continually benchmark improvements and any resulting performance gains as part of your project cycle so you can defend the cost of performance work in the future. Find opportunities to repurpose existing design patterns and document them. As your users grow up, so does modern browser technology; routinely check in on your browser-specific stylesheets, hacks, and other outdated techniques to see what you can clean up. All of this work will minimize the operational costs of performance work over time and allow you to find more ways to balance aesthetics and performance.

Approach New Designs with a Performance Budget

One key to making decisions when weighing aesthetics and page speed is understanding what wiggle room you have. By creating a performance budget early on, you can make performance sacrifices in one area of a page and make up for them in another. In Table 7-3 I’ve illustrated a few measurable performance goals for a site.

TABLE 7-3. Example performance budget

Total page load time 2 seconds WebPagetest, median from five runs on 3G All pages
Total page load time 2 seconds Real user monitoring tool, median across geographies All pages
Total page weight 800 Kb WebPagetest All pages
Speed Index 1,000 WebPagetest using Dulles location in Chrome on 3G All pages except home page
Speed Index 600 WebPagetest using Dulles location in Chrome on 3G Home page

You can favor aesthetics in one area and favor performance in another by defining your budget up front. That way, it’s not always about making choices that favor page speed; you have an opportunity to favor more complex graphics, for example, if you can find page speed wins elsewhere that keep you within your budget. You can call a few more font weights because you found equivalent savings by removing some image requests. You can negotiate killing a marketing tracking script in order to add a better hero image. By routinely measuring how your site performs against your goals, you can continue to find that balance.

To decide on what your performance goals will be, you can conduct a competitive analysis. See how your competitors are performing and make sure your budget is well below their results. You can also use industry standards for your budget: aim for two seconds or less total page time, as you know that’s how fast users expect sites to load.

Iterate upon your budget as you start getting better at performance and as industry standards change. Continue to push yourself and your team to make the site even faster. If you have a responsively designed site, determine a budget for your breakpoints as well, like we did in Chapter 5.

Your outlined performance goals should always be measureable. Be sure to detail the specific number to beat, the tool you’ll use to measure it, as well as any details of what or whom you’re measuring. Read more about how to measure performance in Chapter 6, and make it easy for anyone on your team to learn about this budget and measure his or her work against it.

Designing for Performance by Lara Callender Hogan
ISBN 978-1-4919-0251-6
Copyright 2014 O’Reilly Media, Inc. All right reserved. Used with permission.

1 Comment

Juggling Multiple Elasticsearch Instances on a Single Host

Posted by on December 4, 2014 / 5 Comments

Elasticsearch is a distributed search engine built on top of Apache Lucene. At Etsy we use Elasticsearch in a number of different configurations: for Logstash, powering user-facing search on some large indexes, some analytics usage, and many internal applications.

Typically, it is assumed that there is a 1:1 relationship between ES instances and machines. This is straightforward and makes sense if your instance requirements line up well with the host – whether physical, virtualized or containerized. We run our clusters on bare metal, and for some of them we have more ES instances than physical hosts. We have good reasons for doing this, and here I’ll share some of the rationale, and the configuration options that we’ve found to be worth tuning.


Managing JVMs with large heaps is scary business due to garbage collection run times. 31Gb is the magic threshold above which point you lose the ability to use CompressedOops. In our experience, it is better to have even smaller heaps. Not only do GC pause times stay low, but it’s easier to capture and analyze heap dumps!

To get optimal Lucene performance, it is also important to have sufficient RAM available for OS file caching of the index files.

At the same time, we are running on server-class hardware with plenty of CPU cores and RAM. Our newest search machines are Ivy Bridge with 20 physical (40 virtual) cores, 128Gb of RAM, and of course SSDs. If we run a single node with a small heap on this hardware we would be wasting both CPU and RAM, because the size of shards such an instance will be able to support will also be smaller.

We currently run 4 ES JVMs per machine with 8Gb of heap each. This works out great for us: GC has not been a concern and we are utilizing our hardware effectively.

The settings


Elasticsearch uses this setting to configure thread pool and queue sizing. It defaults to Runtime.getRuntime().availableProcessors(). With multiple instances, it is better to spread the CPU resources across them.

We set this to ($(nproc) / $nodes_per_host). So if we are running 4 nodes per host on 40-core machines, each of them will configure thread pools and queues as if there were 10 cores.

The default for this setting is to pick a random Marvel comic character at startup. In production, we want something that lets us find the node we want with as little thought and effort as possible. We set this to $hostname-$nodeId (which results in names like “search64-es0” – less whimsical, but far more practical when you’re trying to get to the bottom of an issue).

http.port, transport.port

If these ports are not specified, ES tries to pick the next available port at startup, starting at a base of 9200 for HTTP and 9300 for its internal transport.

We prefer to be explicit and assign ports as $basePort+$nodeIdx from the startup script. This can prevent surprises such as where an instance that you expect to be down is still bound to its port, causing the ‘next available’ one to be higher than expected.


A key way to achieve failure tolerance with ES is to use replicas, so that if one host goes down, the affected shards stay available. If you’re running multiple instances on each physical host, it’s entirely possible to automatically allocate all replicas for a shard to the same host, which isn’t going to help you! Thankfully this is avoidable with the use of shard allocation awareness. You can set the hostname as a node attribute on each instance and use that attribute as a factor in shard assignments.

ES_JAVA_OPTS="$ES_JAVA_OPTS -Des.cluster.routing.allocation.awareness.attributes=host"


Without having a dedicated log directory for each instance, you would end up with multiple JVMs trying to write to the same log files.

An alternative, which we rely on, is to prefix the filenames in logging.yml with the property ${} so that each node’s logs are labelled by host and node ID. Another reason to be explicit about node naming!

Minimal Viable Configuration

Elasticsearch has lots of knobs, and it’s worth trying to minimize configuration. That said, a lot of ES is optimized for cloud environments so we occasionally find things worth adjusting, like allocation and recovery throttling. What do you end up tweaking?

You can follow Shikhar on Twitter at @shikhrr


Personalized Recommendations at Etsy

Posted by on November 17, 2014 / 4 Comments

Providing personalized recommendations is important to our online marketplace.  It benefits both buyers and sellers: buyers are shown interesting products that they might not have found on their own, and products get more exposure beyond the seller’s own marketing efforts.  In this post we review some of the methods we use for making recommendations at Etsy.  The MapReduce implementations of all these methods are now included in our open-source machine learning package “Conjecture” which was described in a previous post.

Computing recommendations basically consists of two stages.  In the first stage we build a model of users’ interests based on a matrix of historic data, for example, their past purchases or their favorite listings (those unfamiliar with matrices and linear algebra see e.g., this  review).  The models provide vector representations of users and items, and their inner products give an estimate of the level of interest a user will have in the item (higher values denote a greater degree of estimated interest).  In the second stage, we compute recommendations by finding a set of items for each user which approximately maximizes the estimate of the interest.

The model of users and items can be also used in other ways, such as finding users with similar interests, items which are similar from a “taste” perspective, items which complement each other and could be purchased together, etc.

Matrix Factorization

The first stage in producing recommendations is to fit a model of users and items to the data.  At Etsy, we deal with “implicit feedback” data where we observe only the indicators of users’ interactions with items (e.g., favorites or purchases).  This is in contrast to “explicit feedback” where users give ratings (e.g. 3 of 5 stars) to items they’ve experienced. We represent this implicit feedback data as a binary matrix, the elements are ones in the case where the user liked the item (i.e., favorited it) or a zero if they did not.  The zeros do not necessarily indicate that the user is not interested in that item, but only that they have not expressed an interest so far.  This may be due to disinterest or indifference, or due to the user not having seen that item yet while browsing.



An implicit feedback dataset in which a set of users have “favorited” various items, note that we do not observe explicit dislikes, but only the presence or absence of favorites


The underpinning assumption that matrix factorization models make is that the affinity between a user and an item is explained by a low-dimensional linear model.  This means that each item and user really corresponds to an unobserved real vector of some small dimension.  The coordinates of the space correspond to latent features of the items (these could be things like: whether the item is clothing, whether it has chevrons, whether the background of the picture is brown etc.), the elements for the user vector describe the users preferences for these features.  We may stack these vectors into matrices, one for users and one for items, then the observed data is in theory generated by taking the product of these two unknown matrices and adding noise:


The underpinning low dimensional model from which the observed implicit feedback data is generated, “d” is the dimension of the model.


We therefore find a vector representation for each user and each item.  We compute these vectors so that the inner product between a user vector and item vector will approximate the observed value in the implicit feedback matrix (i.e., it will be close to one in the case the user favorited that item and close to zero if they didn’t).


The results of fitting a two dimensional model to the above dataset, in this small example the the first discovered features roughly corresponds to whether the item is a shelf or not, and the second to whether it is in a “geometric” style.


Since the zeros in the matrix do not necessarily indicate disinterest in the item, we don’t want to force the model to fit to them, since the user may actually be interested in some of those items.  Therefore we find the decomposition which minimizes a weighted error function, where the weights for nonzero entries in the data matrix are higher than those of the zero entries.  This follows a paper which suggested this method.  How to set these weights depends on how sparse the matrix is, and could be found through some form of cross validation.

What happens when we optimize the weighted loss function described above, is that the reconstructed matrix (the product of the two factors) will often have positive elements where the input matrix has zeros, since we don’t force the model to fit to these as well as to the non-zeros.  These are the items which the user may be interested in but has not seen yet.  The reason this happens is that in order for the model to fit well, users who have shown interest in overlapping sets of items will have similar vectors, and likewise for items.  Therefore the unexplored items which are liked by other users with similar interests will often have a high value in the reconstructed matrix.

Alternating Least Squares

To optimize the model, we alternate between computing item matrix and user matrix, and at each stage we minimize the weighted squared error, holding the other matrix fixed (hence the name “alternating least squares”).  At each stage, we can compute the exact minimizer of the weighted square error, since an analytic solution is available.  This means that each iteration is guaranteed not to increase the total error, and to decrease it unless the two matrices already constitute a local minimum of the error function.  Therefore the entire procedure gradually decreases the error until a local minimum is reached.  The quality of these minima can vary, so it may be a reasonable idea to repeat the procedure and select the best one, although we do not do this.  A demo of this method in R is available here.

This computation lends itself very naturally to implementation in MapReduce, since e.g., when updating a vector for a user, all that is needed are the vectors for the items which he has interacted with, and the small square matrix formed by multiplying the items matrix by its own transpose.  This way the computation for each user typically can be done even with limited amounts of memory available, and each user may be updated in parallel.  Likewise for updating items.  There are some users which favorite huge numbers of items and likewise items favorited by many users, and those computations require more memory.  In these cases we can sub-sample the input matrix, either by filtering out these items, or taking only the most recent favorites for each user.

After we are satisfied with the model, we can continue to update it as we observe more information, by repeating a few steps of the alternating least squares every night, as more items, users, and favorites come online.  New items and users can be folded into the model easily, so long as there are sufficiently many interactions between them and existing users and items in the model respectively.  Productionizable MapReduce code for this method is available here.

Stochastic SVD

The alternating least squares described above gives us an easy way to factorize the matrix of user preferences in MapReduce. However, this technique has the disadvantage of requiring several iterations, sometimes taking a long time to converge to a quality solution. An attractive alternative is the Stochastic SVD.  This is a recent method which approximates the well-known Singular Value Decomposition of a large matrix, and which admits a non iterative MapReduce implementation.  We implement this as a function which can be called from any scalding Hadoop MapReduce job.

A fundamental result in linear algebra is that the matrix formed by truncating the singular value decomposition after some number of dimensions is the best approximation to that matrix (in terms of square error) among all matrices of that rank.  However we note that using this method we cannot do the same “weighting” to the error as we did when optimizing via alternating least squares.  Nevertheless for datasets where the zeros do not completely overwhelm the non-zeros then this method is viable.  For example we use it to build a model from the favorites, whereas it fails to provide a useful model from purchases which are much more sparse, and where this weighting is necessary.

An advantage of this method is that it produces matrices with a nice orthonormal structure, which makes it easy to construct the vectors for new users on the fly (outside of a nightly recomputation of the whole model), since no matrix inversions are required.  We also use this method to produce vector representations of other lists of items besides those a user favorited, for example treasuries and other user curated lists on Etsy.  This way we may suggest other relevant items for those lists.

Producing Recommendations

Once we have a model of users and items we use it to build product recommendations.  This is a step which seems to be mostly overlooked in the research literature.  For example, we cannot hope to compute the product of the user and item matrices, and then find the best unexplored items for each user, since this requires time proportional to the product of the number of items and the number of users, both of which are in the hundreds of millions.

One research paper suggests using a tree data structure to allow for a non-exhaustive search of the space, by pruning away entire sets of items where the inner products would be too small.  However we observed this method to not work well in practise, possibly due to the curse of dimensionality with the type of models we were using (with hundreds of dimensions).

Therefore we use approximate methods to compute the recommendations.  The idea is to first produce a candidate set of items, then to rank them according to the inner products, and take the highest ones.  There are a few ways to produce candidates, for example, the listings from favorite shops of a user, or those textually similar to his existing favorites.  However the main way we use is “locality sensitive hashing” (LSH) where we divide the space of user and item vectors into several hash bins, then take the set of items which are mapped to the same bin as each user.

Locality Sensitive Hashing

Locality sensitive hashing is a technique used to find approximate nearest neighbors in large datasets.  There are several variants, but we focus on one designed to handle real-valued data and to approximate the nearest neighbors in the Euclidean distance.

The idea of the method is to partition the space into a set of hash buckets, so that points which are near to each other in space are likely to fall into the same bucket.  The way we do this is by constructing some number “p” of planes in the space so that they all pass through the origin.  This divides the space up into 2^p convex cones, each of which constitutes a hash bucket.

Practically we implement this by representing the planes in terms of their normal vectors.  The side of the plane that a point falls on is then determined by the sign of the inner product between the point and the normal vector (if the planes are random then we have non-zero inner products almost surely, however we could in principle assign those points arbitrarily to one side or the other).  To generate these normal vectors we just need directions uniformly at random in space.  It is well known that draws from an isotropic Gaussian distribution have this property.

We number the hash buckets so that the i^th bit of the hash-code is 1 if the inner product between a point and the i^th plane is positive, and 0 otherwise.  This means that each plane is responsible for a bit of the hash code.

After we map each point to its respective hash bucket, we can compute approximate nearest neighbors, or equivalently, approximate recommendations, by examining only the vectors in the bucket.  On average the number in each bucket will be 2^{-p} times the total number of points, so using more planes makes the procedure very efficient.  However it also reduces the accuracy of the approximation, since it reduces the chance that nearby points to any target point will be in the same bucket.  Therefore to achieve a good tradeoff between efficiency and quality, we repeat the hashing procedure multiple times, and then combine the outputs.  Finally, to add more control to the computational demands of the procedure, we throw away all the hash bins which are too large to allow efficient computation of the nearest neighbors.  This is implemented in Conjecture here.

Other Thoughts

Above are the basic techniques for generating personalized recommendations.  Over the course of developing these recommender systems, we found a few modifications we could make to improve the quality of the recommendations.


In summary we described how we can build recommender systems for e-commerce based on implicit feedback data.  We built a system which computes recommendations on Hadoop, which is now part of our open source machine learning package “Conjecture.”  Finally we shared some additional tweaks that can be made to potentially improve the quality of recommendations.


Building A Better Build: Our Transition From Ant To SBT

Posted by on September 30, 2014 / 7 Comments

A build tool is fundamental to any non-trivial software project.  A good build tool should be as unobtrusive as possible, letting you focus on the code instead of the mechanics of compilation.  At Etsy we had been using Ant to build our big data stack.  While Ant did handle building our projects adequately, it was a common source of questions and frustration for new and experienced users alike.  When we analyzed the problems users were having with the build process, we decided to replace Ant with SBT, as it was a better fit for our use cases.  In this post I’ll discuss the reasons we chose SBT as well as some of the details of the actual process of switching.

Why Did We Switch?

There were two perspectives we considered when choosing a replacement for Ant.  The first is that of a user of our big data stack.  The build tool should stay out of the way of these users as much as possible.  No one should ever feel it is preventing them from being productive, but instead that it is making them more productive.  SBT has a number of advantages in this regard:

  1. Built-in incremental compiler: We used the stand-alone incremental compiler zinc for our Ant build.  However, this required a custom Ant task, and both zinc and that task needed to be installed properly before users could start building.  This was a common source of questions for users new to our big data stack.  SBT ships with the incremental compiler and uses it automatically.
  2. Better environment handling: Because of the size of our code base, we need to tune certain JVM options when compiling it.  With Ant, these options had to be set in the ANT_OPTS environment variable.  Not having ANT_OPTS set properly was another common source of problems for users.  There is an existing community-supported SBT wrapper that solves this problem.  The JVM options we need are set in a .jvmopts file that is checked in with the code.
  3. Triggered execution: If you prefix any SBT command or sequence of commands with a tilde, it will automatically run that command every time a source file is modified.  This is a very powerful tool for getting immediate feedback on changes.  Users can compile code, run tests, or even launch a Scalding job automatically as they work.
  4. Build console: SBT provides an interactive console for running build commands.  Unlike running commands from the shell, this console supports tab-completing arguments to commands and allows temporarily modifying settings.

The other perspective is that of someone modifying the build.  It should be straightforward to make changes.  Furthermore, it should be easy to take advantage of existing extensions from the community.  SBT is also compelling in this regard:

  1. Build definition in Scala: The majority of the code for our big data stack is Scala.  With Ant, modifying the build requires a context switch to its XML-based definition.  An SBT build is defined using Scala, so no context switching is necessary.  Using Scala also provides much more power when defining build tasks.  We were able to replace several Ant tasks that invoked external shell scripts with pure Scala implementations.  Defining the build in Scala does introduce more opportunities for bugs, but you can use scripted to test parts of your build definition.
  2. Plugin system: To extend Ant, you have to give it a JAR file, either on the command line or by placing it in certain directories.  These JAR files then need to be made available to everyone using the build.  With SBT, all you need to do is add a line like
    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")

    to the build definition.  SBT will then automatically download that plugin and its dependencies.  By default SBT will download dependencies from Maven Central, but you can configure it to use only an internal repository:

    resolvers += resolver("internal-repo", "<repository URL>")
    externalResolvers <<= resolvers map { rs =>
        Resolver.withDefaultResolvers(rs, mavenCentral = false)
  3. Ease of inspecting the build: SBT has an “inspect” command that will provide a lot of information about a setting or task: the type, any related settings or tasks, and forward and reverse dependencies.  This is invaluable when trying to understand how some part of the build works.
  4. Debugging support: Most SBT tasks produce verbose debug logs that are not normally displayed.  The “last” command will output those logs, making tracking down the source of an error much easier.

This is not to say that SBT is perfect.  There are some aspects of SBT that are less than ideal:

  1. Terse syntax: The SBT build definition DSL is full of hieroglyphic symbols, such as the various assignment operators: <<=, <+=, and <++=.  This terse syntax can be intimidating for those that are not used to it.
  2. Support for compiling a subset of files: The Scala compiler can be slow, so our Ant build supported compiling only a subset of files.  This was very easy to add in Ant.  SBT required a custom plugin that delved into SBT’s internal APIs.  This does work, but we have already had to do some minor refactoring between SBT 0.13.1 and 0.13.2.  We have to accept that that such incompatibility is likely to occur between future minor versions.
  3. Less mature: SBT is a relatively new tool, and the plugin ecosystem is still growing.  As such, there were several times we had to write custom plugins for functionality that would likely already be supported by another tool.  We also experienced a couple of bugs in SBT itself during the switch.

At this point it’s fair to ask why SBT, and not Maven, Gradle, or some other tool.  You can certainly get features like these in other tools.  We did not do a detailed comparison of multiple tools to replace Ant.  SBT addressed our pain points with the Ant build, and it was already popular with users of our big data stack.

The Transition Process

It may sound cliché, but the initial hurdle for actually switching to SBT was gaining an in-depth understanding of the tool.  SBT’s terminology and model of a build is very different from Ant.  SBT’s documentation is very good and very thorough, however.  Starting the switch without having read through it would have resulted in a lot of wasted time.  It’s a lot easier to find answers online when you know how SBT names things!

The primary goal when switching was to have the SBT build be a drop-in replacement for the Ant build.  This removed the need to modify any external processes that use the build artifacts.  It also allowed us to have the Ant and SBT builds in parallel for a short time in case of bugs in the SBT build.  Unfortunately, our project layout did not conform to SBT’s conventions — but SBT is fully configurable in this regard.  SBT’s “inspect” command was helpful here for discovering which settings to tweak.  It was also good that SBT supports using an external ivy.xml file for dependency management.  We were already using Ivy with our Ant build, so we were able to define our dependencies in one place while running both builds in parallel.

With the configuration to match our project layout taken care of, actually defining the build with SBT was a straightforward task.  Unlike Ant, SBT has built-in tasks for common operations, like compiling, running tests, and creating a JAR.  SBT’s multi-project build support also cut down on the amount of configuration necessary.  We have multiple projects building together, and every single Ant task had to be defined to build them in the correct order.  Once we configured the dependencies between the projects in SBT, it automatically guaranteed this.

We also took advantage of SBT’s aliasing feature to make the switch easier on users.  The names of SBT’s built-in tasks did not always align with the names we had picked when defining the corresponding task in Ant.  It’s very easy to alias commands, however:

addCommandAlias("jar", "package")

With such aliases users were able to start using SBT just as they had used Ant, without needing to learn a whole set of new commands.  Aliases also make it easy to define sequences of commands without needing to create an entirely new task, such as

addCommandAlias("rebuild", ";clean; compile; package")

The process of switching was not entirely smooth, however.  We did run into two bugs in SBT itself.  The first is triggered when you define an alias that is the prefix of a task in the alias definition.  Tab-completing or executing that alias causes SBT to hang for some time and eventually die with a StackOverflowError.  This bug is easy to avoid if you define your alias names appropriately, and it is fixed in SBT 0.13.2.  The other bug only comes up with multi-module projects.  Even though our modules have many dependencies in common, SBT will re-resolve these dependencies for each module.  There is now a fix for this in SBT 0.13.6, the most recently released version, that can be enabled by adding

updateOptions := updateOptions.value.withConsolidatedResolution(true)

to your build definition.  We saw about a 25% decrease in time spent in dependency resolution as a result.

Custom Plugins

As previously mentioned, we had to write several custom plugins during the process of switching to SBT to reproduce all the functionality of our Ant build.  We are now open-sourcing two of these SBT plugins.  The first is sbt-checkstyle-plugin.  This plugin allows running Checkstyle over Java sources with the checkstyle task.  The second is sbt-compile-quick-plugin.  This plugin allows you to compile and package a single file with the compileQuick and packageQuick tasks, respectively.  Both of these plugins are available in Maven Central.


Switching build tools isn’t a trivial task, but it has paid off.  Switching to SBT has allowed us to address multiple pain points with our previous Ant build.  We’ve been using SBT for several weeks now.  As with any new tool, there was a need for some initial training to get everyone started.  Overall though, the switch has been a success!   The number of issues encountered with the build has dropped.  As users learn SBT they are taking advantage of features like triggered execution to increase their productivity.


Expanding Diversity Efforts with Hacker School

Posted by on September 25, 2014 / 2 Comments

Today we’re proud to announce that Etsy will provide $210,000 in Etsy Hacker Grants to Hacker School applicants in the coming year. These grants extend our support of Hacker School’s diversity initiatives, which first focused on the gender imbalance of applicants to their program and in the wider industry, and will now expand to support applicants with racial backgrounds that are underrepresented in software engineering.

The grants will provide up to $7,000 in support for at least 35 accepted applicants in the next three batches of Hacker School, and are designed to help with a student’s living expenses during their three-month curriculum in New York City.

Diversity and opportunity lie at the heart of what Etsy is about, a place where anyone can start a business for $0.20. Today we wrote a post talking about how we think about diversity at Etsy: More Than Just Numbers. As an engineering team, diversity and opportunity are core values for us as well — we believe a diverse environment is a resilient one. We love what we do and want to extend that opportunity to anyone who wants it.

This doesn’t mean the work to support diversity is easy or that we’ve mastered it — but we are committed to continuing to improve. Over the years, we’ve focused on educational programs looking at unconscious bias, bringing in speakers from NCWIT, and building out our internal leadership courses to support a broad swath of new leaders.

Hacker School grants have been one of our favorite and most effective programs since sponsoring our first batch of students in summer 2012. We’ve even given talks about how well it went. Hacker School’s selection process and environment combine to create a group of students diverse across a number of axes, including gender, race, experience and technical background, but that are also culturally and technically aligned with Etsy’s engineering team. Hacker School’s welcoming “programming retreat” approach produces the sort of broad, deep, tool-building and curious system engineers that work well in our distributed, iterative, transparent and scaled environment. We have Hacker School alums across almost every team in Etsy engineering and at nearly every level, from just starting their first job to very senior engineers.

We know that racial diversity is a complicated issue, and we are by no means the experts. But we believe that together with Hacker School we are making a small step in the right direction.

And we need your help. This program only works if qualified applicants hear that it’s happening, and know that we really want them to apply. If you know someone who you think would be great, please let them know, and encourage them to apply to an upcoming batch!


Come find Etsy at Velocity NY 2014

Posted by on September 10, 2014

Velocity is our kind of conference, and Velocity NY happens in our backyard in that funny little borough on the other side of the river. (Manhattan) Make sure to come find us, we’ll be there teaching, speaking, and keynoting:

Monday 9am “PostMortem Facilitation: Theory and Practice of “New View” Debriefings” – John Allspaw

Monday 1:30pm “Building a Device Lab” – Lara Swanson and Destiny Montague

Tuesday 1:15pm “A Woman’s Place is at the Command Prompt” – Lara Swanson (moderator), Katherine Daniels (Etsy), Jennifer Davis (Chef), Bridget Kromhout (DramaFever), Samantha Thomas (UMN)

Tuesday 3:30pm “It’s 3AM, Do You Know Why You Got Paged?” – Ryan Frantz

Tuesday 5pm “Etsy’s Journey to Building a Continuous Integration Infrastructure for Mobile Apps” – Nassim Kammah

Wednesday 11:20am “Unpacking the Black Box: Benchmarking JS Parsing and Execution on Mobile Devices” – Daniel Espeset

Holding office hours

Nassim Kammah of our continuous integration team, the Autobot

Ryan Frantz our sleep and alert design obsessed operations engineer

John Allspaw, he should look familiar to you by now

Signing books

Jon Cowie will be signing his book “Customizing Chef”
Lara Swanson will be signing galleys of “Designing for Performance”
John Allspaw will be signing “Web Operations” and “Capacity Planning”

No Comments

Teaching Testing: Our Testing 101 Materials

Posted by on August 20, 2014 / 2 Comments

Etsy engineers have a wide variety of backgrounds, strengths, and weaknesses, so there are no engineering skills we can take for granted. And there are things you can’t just assume engineers will learn for themselves because you throw a codebase and a workflow at them.

I work on Etsy’s continuous deployment team, which advises on automated testing of our code, and I felt that we could use some stronger means of teaching (and establishing as a conscious value) the skills of testing and design in code. To that end, I recently wrote two “Testing 101” materials for use by all Etsy engineers. They’re now both on our public Github: the Etsy Testing Best Practices Guide, and our Testing 101 Code Lab for hands-on practice applying its ideas. Both use PHP and PHPUnit.

We called it the “Testing Best Practices Guide” because we love misnomers. It’s more about design than testing, it describes few concrete practices, and we don’t call any of them “best” .

Within Etsy, we supplement mere documents with activities like team rotations (“bootcamps”) for new hires, technical talks, and dojos (collaborative coding exercises) to practice and have fun with coding topics as a group. And of course, we do code review.

Deeper technical considerations are often invisible in code, so you have to find a way, whether by process, tooling, or teaching, to make them visible.


Q2 2014 Site Performance Report

Posted by on August 1, 2014 / 3 Comments

As the summer really starts to heat up, it’s time to update you on how our site performed in Q2. The methodology for this report is identical to the Q1 report. Overall it’s a mixed bag: some pages are faster and some are slower. We have context for all of it, so let’s take a look.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, July 16th:

Server Side Performance

A few things stand out in this data:

Median homepage performance improved, but the 95th percentile got slower. This is due to a specific variant we are testing which is slower than the current page. We made some code changes that improved load time for the majority of pageviews, but the one slower test variant brings up the higher percentiles.

The listing page saw a fairly large increase in both median and 95th percentile load time. There isn’t a single smoking gun for this, but rather a number of small changes that caused little increases in performance over the last three months.

Search saw a significant decrease across the board. This is due to a dedicated memcached cluster that we rolled out to cache “listing cards” on the search results page. This brought our cache hit rate for listing related data up to 100%, since we automatically refresh the cache when the data changes. This was a nice win that will be sustainable over the long term.

The shop page saw a big jump at the 95th percentile. This is again due to experiments we are running on this page. A few of the variants we are testing for a shop redesign are slower than the existing page, which has a big impact on the higher percentiles. It remains to be seen which of these variants will win, and which version of the page we will end up with.

Overall we saw more increases than decreases on the backend, but we had a couple of performance wins from code/architecture changes, which is always nice to see. Looking ahead, we are planning on replacing the hardware in our memcached cluster in the next couple of months, and tests show that this should have a positive performance impact across the entire site.

Synthetic Front-end Performance

As a reminder, these tests are run with Catchpoint. They use IE9, and they run from New York, London, Chicago, Seattle, and Miami every two hours. The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page. Here is that data:

Synthetic Performance

The render start metrics are pretty much the same across the board, with a couple of small decreases that aren’t really worth calling out due to rounding error and network variability. The “webpage response” numbers, on the other hand, are up significantly across the board. This is easily explained: we recently rolled out full site TLS, and changed our synthetic tests to hit https URLs. The added TLS negotiation time for all assets on the page bumped up the overall page load time everywhere. One thing we noticed with this change is that due to most browsers making six TCP connections per domain, we pay this TLS negotiation cost many times per page. We are actively investigating SPDY with the goal of sending all of our assets over one connection and only doing this negotiation once.

Real User Front-end Performance

As always, these numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

RUM Data

One change here is that we are showing an extra significant figure in the RUM data. We increased the number of beacons that we send to mPulse, and our error margin dropped to 0.00 seconds, so we feel confident showing 10ms resolution. We see the expected drop in search load time because of the backend improvement, and everything else is pretty much neutral. The homepage ticked up slightly, which is expected due to the experiment that I mentioned in the server side load time section.

One obvious question is: “Why did the synthetic numbers change so much while the RUM data is pretty much neutral?”. Remember, the synthetic numbers changed primarily because of a change to the tests themselves. The switch to https caused a step change in our synthetic monitoring, but for real users the rollout was gradual. In addition, real users that see more than one page have some site resources in their browser cache, mitigating some of the extra TLS negotiations. Our synthetic tests always operate with an empty cache, which is a bit unrealistic. This is one of the reasons why we have both synthetic and RUM metrics: if one of them looks a little wonky we can verify the difference with other data. Here’s a brief comparison of the two, showing where each one excels:

Synthetic Monitoring Real User Monitoring
Browser Instrumentation Navigation Timing API
Consistent Trending over time Can be highly variable as browsers and networks change
Largely in your control Last mile difficulties
Great for identifying regressions Great for comparing across geographies/browsers
Not super realistic from an absolute number point of view “Real User Monitoring”
A/B tests can show outsized results due to empty caches A/B tests will show the real world impact


This report had some highs and some lows, but at the end of the day our RUM data shows that our members are getting roughly the same experience they were a few months ago performance wise, with faster search results pages. We’re optimistic that upgrading our memcached cluster will put a dent in our backend numbers for the next report, and hopefully some of our experiments will have concluded with positive results as well. Look for another report from us as the holiday season kicks off!


Just Culture resources

Posted by on July 18, 2014 / 2 Comments

This is a (very) incomplete list of resources that may be useful to those wanting to find out more about the ideas behind Just Culture, blamelessness, complex systems, and related topics. It was created to support my DevOps Days Minneapolis talk Fallible Humans.

Human error and sources of failure

Just Culture

Postmortems and blameless reviews

Complex systems and complex system failure


Calendar Hacks

Posted by on July 15, 2014 / 7 Comments

As an engineering manager, there’s one major realization you have: managers go to lots of meetings. After chatting with a bunch of fellow engineering managers at Etsy, I realized that people have awesome hacks for managing their calendars and time. Here are some of the best ones from a recent poll of Etsy engineering managers! We’ll cover tips on how to:

To access any of the Labs settings/apps:


Block out time

Create big unscheduled blocks every week. It allows for flexibility in your schedule. Some block out 3 afternoons/week as “office hours — don’t schedule unless you’re on my team”. It creates uninterrupted time when I’m *not* in meetings and available on IRC. Some book time to work on specific projects, and mark the event as private. They’ll try to strategize the blocking to prevent calendar fragmentation.

2-auto-declineAutomatically decline events (Labs): Lets you block off times in your calendar when you are unavailable. Invitations sent for any events during this period will be automatically declined. After you enable this feature, you’ll find a “Busy (decline invitations)” option in the “Show me as” field.

Office Hours: Blocks for office hours allow you to easily say, “yes, I want to talk to you, but can we schedule it for a time I’ve already set aside?” Better than saying “No, I cannot talk to you, my calendar is too full.” (Which also has to happen from time to time.) When you create a new event on your calendar, choose the “Appointment slots” link in the top of the pop-up:


Then follow the instructions to create a bookable slot on your calendar. You’ll be able to share a link to a calendar with your bookable time slots:


Declining meetings: Decline meetings with a note. Unless you work with the organizer often, do this automatically if there’s no context for the meeting. One manager has declined 1 hour meetings, apologized for missing them, asked for a follow up, and found that he really wasn’t needed in the first place. Hours saved!

Change your defaults

Shorter meetings: Don’t end meetings on the hour; use 25/50 minute blocks. You can set this in Calendars>My Calendars>Settings>General>Default Meeting Length. If you set your calendar invites to 25 minutes or 55 minutes, you need to assert that socially at the beginning of the meeting, and then explicitly do a time check at 20 minutes (“we have 5 minutes left in this meeting”). If possible, start with the 25 (or 30) minute defaults rather than the hour-long ones.


Visible vs private: Some make their calendar visible, not private by default. This lets other people see what they’re doing so they have a sense of whether they can schedule against something or not–a team meeting–maybe, mad men discussion group, no way.

Custom view: Create a custom 5- or 7-day view or use Week view.


Color coding:

Gentle Reminders (Labs): This feature replaces Calendar’s pop-ups: when you get a reminder, the title of the Google Calendar window or tab will happily blink in the background and you will hear a pleasant sound.

Event dimming: dim past events, and/or recurring future events, so you can focus.


Editability: Make all meetings modifiable and including this in the invite description, so as to avoid emails about rescheduling (“you have modifying rights — reschedule as needed, my schedule is updated”) and ask reports and your manager to do the same when they send  invites. This can result in fewer emails back and forth to reschedule.

Rely on apps and automation

Sunrise: Sunrise for iOS does the right thing and doesn’t show declined events, which weaponizes the auto-decline block.

8-world-clockWorld Clock (Labs) helps you keep track of the time around the world. Plus: when you click an event, you’ll see the start time in each time zone as well. You could alternatively add an additional timezone (settings -> general -> Your Time Zone -> Additional Timezone).

Free or busy (Labs): See which of your friends are free or busy right now (requires that friends share their Google Calendars with you.

Next meeting (Labs): See what’s coming up next in your calendar.

Soft timer: Set a timer on your phone by default at most of meetings, group or 1:1s,  and tell folks you’re setting an alarm to notify everyone when you have 5 minutes left. People love this, especially in intense 1:1s, because they don’t have to worry about the time. It can go really well as a reminder to end in “Speedy Meeting” time.

Do routine cleanup

Calendar review first thing in the morning. Review and clean up next three days. Always delete the “broadcast” invites you’re not going to attend.

Block off some empty timeslots first thing in the morning to make sure you have some breaks during the day—using Busy / Auto Decline technology.

People don’t seem to notice when you note all day events at the top of your calendar. If you’re going offsite,  book an event that’s 9am-7pm that will note your location.

Think ahead (long-term)

Book recurring reminders: For instance, do a team comp review every three months. Or if there is a candidate we lost out on, make appointments to follow up with them in a few months.

Limit Monday recurring meetings: Holiday weekends always screw you up when you have to reschedule all of those for later in the week.

Track high-energy hours: “I tracked my high energy hours against my bright-mind-needed tasks and lo-and-behold, realized that mornings is when I need a lot of time to do low-urgency/high importance things that otherwise I wasn’t making time for or I was doing in a harried state. I thus time-blocked 3 mornings a week where from 8am to 11am I am working on these things (planning ahead, staring at the wall thinking through a system issue, etc). It requires a new 10pm to 6am sleeping habit, but it has been fully worth it, I feel like I gained a day in my way. This means I no longer work past 6pm most days, which was actually what was most draining to me.”

Create separate calendars

Have a shared calendar for the team for PTO tracking, bootcampers, time out of the office, team standups etc.

Subscribe to the Holidays in the United States Calendar so as to not be surprised by holidays: Calendar ID:

Subscribe to the conference room calendars if that’s something your organization has.

Create a secondary calendar for non critical events, so they stay visible but don’t block your calendar. If there’s an event you are interested in, but haven’t decided on going or not, and don’t want other people to feel the need to schedule around it, you can go the event and copy it. Then remove it from your primary calendar. You can toggle the secondary calendar off via the side-panel, and if someone needs to set something up with you, you’ll be available.9-mergeWhen using multiple calendars, there may be events that are important to have on multiple calendars, but this takes up a lot of screen real estate. In these cases, we use Etsy engineer Amy Ciavolino’s Event Merge for Google Calendar Chrome extension. It makes it easy to see those events without them taking up precious screen space.

And, for a little break from the office, be sure to check out Etsy analyst Hilary Parker’s Sunsets in Google Calendar (using R!).