Juggling Multiple Elasticsearch Instances on a Single Host

Posted by on December 4, 2014 / 5 Comments

Elasticsearch is a distributed search engine built on top of Apache Lucene. At Etsy we use Elasticsearch in a number of different configurations: for Logstash, powering user-facing search on some large indexes, some analytics usage, and many internal applications.

Typically, it is assumed that there is a 1:1 relationship between ES instances and machines. This is straightforward and makes sense if your instance requirements line up well with the host – whether physical, virtualized or containerized. We run our clusters on bare metal, and for some of them we have more ES instances than physical hosts. We have good reasons for doing this, and here I’ll share some of the rationale, and the configuration options that we’ve found to be worth tuning.


Managing JVMs with large heaps is scary business due to garbage collection run times. 31Gb is the magic threshold above which point you lose the ability to use CompressedOops. In our experience, it is better to have even smaller heaps. Not only do GC pause times stay low, but it’s easier to capture and analyze heap dumps!

To get optimal Lucene performance, it is also important to have sufficient RAM available for OS file caching of the index files.

At the same time, we are running on server-class hardware with plenty of CPU cores and RAM. Our newest search machines are Ivy Bridge with 20 physical (40 virtual) cores, 128Gb of RAM, and of course SSDs. If we run a single node with a small heap on this hardware we would be wasting both CPU and RAM, because the size of shards such an instance will be able to support will also be smaller.

We currently run 4 ES JVMs per machine with 8Gb of heap each. This works out great for us: GC has not been a concern and we are utilizing our hardware effectively.

The settings


Elasticsearch uses this setting to configure thread pool and queue sizing. It defaults to Runtime.getRuntime().availableProcessors(). With multiple instances, it is better to spread the CPU resources across them.

We set this to ($(nproc) / $nodes_per_host). So if we are running 4 nodes per host on 40-core machines, each of them will configure thread pools and queues as if there were 10 cores.


The default for this setting is to pick a random Marvel comic character at startup. In production, we want something that lets us find the node we want with as little thought and effort as possible. We set this to $hostname-$nodeId (which results in names like “search64-es0” – less whimsical, but far more practical when you’re trying to get to the bottom of an issue).

http.port, transport.port

If these ports are not specified, ES tries to pick the next available port at startup, starting at a base of 9200 for HTTP and 9300 for its internal transport.

We prefer to be explicit and assign ports as $basePort+$nodeIdx from the startup script. This can prevent surprises such as where an instance that you expect to be down is still bound to its port, causing the ‘next available’ one to be higher than expected.


A key way to achieve failure tolerance with ES is to use replicas, so that if one host goes down, the affected shards stay available. If you’re running multiple instances on each physical host, it’s entirely possible to automatically allocate all replicas for a shard to the same host, which isn’t going to help you! Thankfully this is avoidable with the use of shard allocation awareness. You can set the hostname as a node attribute on each instance and use that attribute as a factor in shard assignments.

ES_JAVA_OPTS="$ES_JAVA_OPTS -Des.cluster.routing.allocation.awareness.attributes=host"


Without having a dedicated log directory for each instance, you would end up with multiple JVMs trying to write to the same log files.

An alternative, which we rely on, is to prefix the filenames in logging.yml with the property ${node.name} so that each node’s logs are labelled by host and node ID. Another reason to be explicit about node naming!

Minimal Viable Configuration

Elasticsearch has lots of knobs, and it’s worth trying to minimize configuration. That said, a lot of ES is optimized for cloud environments so we occasionally find things worth adjusting, like allocation and recovery throttling. What do you end up tweaking?

You can follow Shikhar on Twitter at @shikhrr


Personalized Recommendations at Etsy

Posted by on November 17, 2014 / 4 Comments

Providing personalized recommendations is important to our online marketplace.  It benefits both buyers and sellers: buyers are shown interesting products that they might not have found on their own, and products get more exposure beyond the seller’s own marketing efforts.  In this post we review some of the methods we use for making recommendations at Etsy.  The MapReduce implementations of all these methods are now included in our open-source machine learning package “Conjecture” which was described in a previous post.

Computing recommendations basically consists of two stages.  In the first stage we build a model of users’ interests based on a matrix of historic data, for example, their past purchases or their favorite listings (those unfamiliar with matrices and linear algebra see e.g., this  review).  The models provide vector representations of users and items, and their inner products give an estimate of the level of interest a user will have in the item (higher values denote a greater degree of estimated interest).  In the second stage, we compute recommendations by finding a set of items for each user which approximately maximizes the estimate of the interest.

The model of users and items can be also used in other ways, such as finding users with similar interests, items which are similar from a “taste” perspective, items which complement each other and could be purchased together, etc.

Matrix Factorization

The first stage in producing recommendations is to fit a model of users and items to the data.  At Etsy, we deal with “implicit feedback” data where we observe only the indicators of users’ interactions with items (e.g., favorites or purchases).  This is in contrast to “explicit feedback” where users give ratings (e.g. 3 of 5 stars) to items they’ve experienced. We represent this implicit feedback data as a binary matrix, the elements are ones in the case where the user liked the item (i.e., favorited it) or a zero if they did not.  The zeros do not necessarily indicate that the user is not interested in that item, but only that they have not expressed an interest so far.  This may be due to disinterest or indifference, or due to the user not having seen that item yet while browsing.



An implicit feedback dataset in which a set of users have “favorited” various items, note that we do not observe explicit dislikes, but only the presence or absence of favorites


The underpinning assumption that matrix factorization models make is that the affinity between a user and an item is explained by a low-dimensional linear model.  This means that each item and user really corresponds to an unobserved real vector of some small dimension.  The coordinates of the space correspond to latent features of the items (these could be things like: whether the item is clothing, whether it has chevrons, whether the background of the picture is brown etc.), the elements for the user vector describe the users preferences for these features.  We may stack these vectors into matrices, one for users and one for items, then the observed data is in theory generated by taking the product of these two unknown matrices and adding noise:


The underpinning low dimensional model from which the observed implicit feedback data is generated, “d” is the dimension of the model.


We therefore find a vector representation for each user and each item.  We compute these vectors so that the inner product between a user vector and item vector will approximate the observed value in the implicit feedback matrix (i.e., it will be close to one in the case the user favorited that item and close to zero if they didn’t).


The results of fitting a two dimensional model to the above dataset, in this small example the the first discovered features roughly corresponds to whether the item is a shelf or not, and the second to whether it is in a “geometric” style.


Since the zeros in the matrix do not necessarily indicate disinterest in the item, we don’t want to force the model to fit to them, since the user may actually be interested in some of those items.  Therefore we find the decomposition which minimizes a weighted error function, where the weights for nonzero entries in the data matrix are higher than those of the zero entries.  This follows a paper which suggested this method.  How to set these weights depends on how sparse the matrix is, and could be found through some form of cross validation.

What happens when we optimize the weighted loss function described above, is that the reconstructed matrix (the product of the two factors) will often have positive elements where the input matrix has zeros, since we don’t force the model to fit to these as well as to the non-zeros.  These are the items which the user may be interested in but has not seen yet.  The reason this happens is that in order for the model to fit well, users who have shown interest in overlapping sets of items will have similar vectors, and likewise for items.  Therefore the unexplored items which are liked by other users with similar interests will often have a high value in the reconstructed matrix.

Alternating Least Squares

To optimize the model, we alternate between computing item matrix and user matrix, and at each stage we minimize the weighted squared error, holding the other matrix fixed (hence the name “alternating least squares”).  At each stage, we can compute the exact minimizer of the weighted square error, since an analytic solution is available.  This means that each iteration is guaranteed not to increase the total error, and to decrease it unless the two matrices already constitute a local minimum of the error function.  Therefore the entire procedure gradually decreases the error until a local minimum is reached.  The quality of these minima can vary, so it may be a reasonable idea to repeat the procedure and select the best one, although we do not do this.  A demo of this method in R is available here.

This computation lends itself very naturally to implementation in MapReduce, since e.g., when updating a vector for a user, all that is needed are the vectors for the items which he has interacted with, and the small square matrix formed by multiplying the items matrix by its own transpose.  This way the computation for each user typically can be done even with limited amounts of memory available, and each user may be updated in parallel.  Likewise for updating items.  There are some users which favorite huge numbers of items and likewise items favorited by many users, and those computations require more memory.  In these cases we can sub-sample the input matrix, either by filtering out these items, or taking only the most recent favorites for each user.

After we are satisfied with the model, we can continue to update it as we observe more information, by repeating a few steps of the alternating least squares every night, as more items, users, and favorites come online.  New items and users can be folded into the model easily, so long as there are sufficiently many interactions between them and existing users and items in the model respectively.  Productionizable MapReduce code for this method is available here.

Stochastic SVD

The alternating least squares described above gives us an easy way to factorize the matrix of user preferences in MapReduce. However, this technique has the disadvantage of requiring several iterations, sometimes taking a long time to converge to a quality solution. An attractive alternative is the Stochastic SVD.  This is a recent method which approximates the well-known Singular Value Decomposition of a large matrix, and which admits a non iterative MapReduce implementation.  We implement this as a function which can be called from any scalding Hadoop MapReduce job.

A fundamental result in linear algebra is that the matrix formed by truncating the singular value decomposition after some number of dimensions is the best approximation to that matrix (in terms of square error) among all matrices of that rank.  However we note that using this method we cannot do the same “weighting” to the error as we did when optimizing via alternating least squares.  Nevertheless for datasets where the zeros do not completely overwhelm the non-zeros then this method is viable.  For example we use it to build a model from the favorites, whereas it fails to provide a useful model from purchases which are much more sparse, and where this weighting is necessary.

An advantage of this method is that it produces matrices with a nice orthonormal structure, which makes it easy to construct the vectors for new users on the fly (outside of a nightly recomputation of the whole model), since no matrix inversions are required.  We also use this method to produce vector representations of other lists of items besides those a user favorited, for example treasuries and other user curated lists on Etsy.  This way we may suggest other relevant items for those lists.

Producing Recommendations

Once we have a model of users and items we use it to build product recommendations.  This is a step which seems to be mostly overlooked in the research literature.  For example, we cannot hope to compute the product of the user and item matrices, and then find the best unexplored items for each user, since this requires time proportional to the product of the number of items and the number of users, both of which are in the hundreds of millions.

One research paper suggests using a tree data structure to allow for a non-exhaustive search of the space, by pruning away entire sets of items where the inner products would be too small.  However we observed this method to not work well in practise, possibly due to the curse of dimensionality with the type of models we were using (with hundreds of dimensions).

Therefore we use approximate methods to compute the recommendations.  The idea is to first produce a candidate set of items, then to rank them according to the inner products, and take the highest ones.  There are a few ways to produce candidates, for example, the listings from favorite shops of a user, or those textually similar to his existing favorites.  However the main way we use is “locality sensitive hashing” (LSH) where we divide the space of user and item vectors into several hash bins, then take the set of items which are mapped to the same bin as each user.

Locality Sensitive Hashing

Locality sensitive hashing is a technique used to find approximate nearest neighbors in large datasets.  There are several variants, but we focus on one designed to handle real-valued data and to approximate the nearest neighbors in the Euclidean distance.

The idea of the method is to partition the space into a set of hash buckets, so that points which are near to each other in space are likely to fall into the same bucket.  The way we do this is by constructing some number “p” of planes in the space so that they all pass through the origin.  This divides the space up into 2^p convex cones, each of which constitutes a hash bucket.

Practically we implement this by representing the planes in terms of their normal vectors.  The side of the plane that a point falls on is then determined by the sign of the inner product between the point and the normal vector (if the planes are random then we have non-zero inner products almost surely, however we could in principle assign those points arbitrarily to one side or the other).  To generate these normal vectors we just need directions uniformly at random in space.  It is well known that draws from an isotropic Gaussian distribution have this property.

We number the hash buckets so that the i^th bit of the hash-code is 1 if the inner product between a point and the i^th plane is positive, and 0 otherwise.  This means that each plane is responsible for a bit of the hash code.

After we map each point to its respective hash bucket, we can compute approximate nearest neighbors, or equivalently, approximate recommendations, by examining only the vectors in the bucket.  On average the number in each bucket will be 2^{-p} times the total number of points, so using more planes makes the procedure very efficient.  However it also reduces the accuracy of the approximation, since it reduces the chance that nearby points to any target point will be in the same bucket.  Therefore to achieve a good tradeoff between efficiency and quality, we repeat the hashing procedure multiple times, and then combine the outputs.  Finally, to add more control to the computational demands of the procedure, we throw away all the hash bins which are too large to allow efficient computation of the nearest neighbors.  This is implemented in Conjecture here.

Other Thoughts

Above are the basic techniques for generating personalized recommendations.  Over the course of developing these recommender systems, we found a few modifications we could make to improve the quality of the recommendations.


In summary we described how we can build recommender systems for e-commerce based on implicit feedback data.  We built a system which computes recommendations on Hadoop, which is now part of our open source machine learning package “Conjecture.”  Finally we shared some additional tweaks that can be made to potentially improve the quality of recommendations.


Building A Better Build: Our Transition From Ant To SBT

Posted by on September 30, 2014 / 7 Comments

A build tool is fundamental to any non-trivial software project.  A good build tool should be as unobtrusive as possible, letting you focus on the code instead of the mechanics of compilation.  At Etsy we had been using Ant to build our big data stack.  While Ant did handle building our projects adequately, it was a common source of questions and frustration for new and experienced users alike.  When we analyzed the problems users were having with the build process, we decided to replace Ant with SBT, as it was a better fit for our use cases.  In this post I’ll discuss the reasons we chose SBT as well as some of the details of the actual process of switching.

Why Did We Switch?

There were two perspectives we considered when choosing a replacement for Ant.  The first is that of a user of our big data stack.  The build tool should stay out of the way of these users as much as possible.  No one should ever feel it is preventing them from being productive, but instead that it is making them more productive.  SBT has a number of advantages in this regard:

  1. Built-in incremental compiler: We used the stand-alone incremental compiler zinc for our Ant build.  However, this required a custom Ant task, and both zinc and that task needed to be installed properly before users could start building.  This was a common source of questions for users new to our big data stack.  SBT ships with the incremental compiler and uses it automatically.
  2. Better environment handling: Because of the size of our code base, we need to tune certain JVM options when compiling it.  With Ant, these options had to be set in the ANT_OPTS environment variable.  Not having ANT_OPTS set properly was another common source of problems for users.  There is an existing community-supported SBT wrapper that solves this problem.  The JVM options we need are set in a .jvmopts file that is checked in with the code.
  3. Triggered execution: If you prefix any SBT command or sequence of commands with a tilde, it will automatically run that command every time a source file is modified.  This is a very powerful tool for getting immediate feedback on changes.  Users can compile code, run tests, or even launch a Scalding job automatically as they work.
  4. Build console: SBT provides an interactive console for running build commands.  Unlike running commands from the shell, this console supports tab-completing arguments to commands and allows temporarily modifying settings.

The other perspective is that of someone modifying the build.  It should be straightforward to make changes.  Furthermore, it should be easy to take advantage of existing extensions from the community.  SBT is also compelling in this regard:

  1. Build definition in Scala: The majority of the code for our big data stack is Scala.  With Ant, modifying the build requires a context switch to its XML-based definition.  An SBT build is defined using Scala, so no context switching is necessary.  Using Scala also provides much more power when defining build tasks.  We were able to replace several Ant tasks that invoked external shell scripts with pure Scala implementations.  Defining the build in Scala does introduce more opportunities for bugs, but you can use scripted to test parts of your build definition.
  2. Plugin system: To extend Ant, you have to give it a JAR file, either on the command line or by placing it in certain directories.  These JAR files then need to be made available to everyone using the build.  With SBT, all you need to do is add a line like
    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")

    to the build definition.  SBT will then automatically download that plugin and its dependencies.  By default SBT will download dependencies from Maven Central, but you can configure it to use only an internal repository:

    resolvers += resolver("internal-repo", "<repository URL>")
    externalResolvers <<= resolvers map { rs =>
        Resolver.withDefaultResolvers(rs, mavenCentral = false)
  3. Ease of inspecting the build: SBT has an “inspect” command that will provide a lot of information about a setting or task: the type, any related settings or tasks, and forward and reverse dependencies.  This is invaluable when trying to understand how some part of the build works.
  4. Debugging support: Most SBT tasks produce verbose debug logs that are not normally displayed.  The “last” command will output those logs, making tracking down the source of an error much easier.

This is not to say that SBT is perfect.  There are some aspects of SBT that are less than ideal:

  1. Terse syntax: The SBT build definition DSL is full of hieroglyphic symbols, such as the various assignment operators: <<=, <+=, and <++=.  This terse syntax can be intimidating for those that are not used to it.
  2. Support for compiling a subset of files: The Scala compiler can be slow, so our Ant build supported compiling only a subset of files.  This was very easy to add in Ant.  SBT required a custom plugin that delved into SBT’s internal APIs.  This does work, but we have already had to do some minor refactoring between SBT 0.13.1 and 0.13.2.  We have to accept that that such incompatibility is likely to occur between future minor versions.
  3. Less mature: SBT is a relatively new tool, and the plugin ecosystem is still growing.  As such, there were several times we had to write custom plugins for functionality that would likely already be supported by another tool.  We also experienced a couple of bugs in SBT itself during the switch.

At this point it’s fair to ask why SBT, and not Maven, Gradle, or some other tool.  You can certainly get features like these in other tools.  We did not do a detailed comparison of multiple tools to replace Ant.  SBT addressed our pain points with the Ant build, and it was already popular with users of our big data stack.

The Transition Process

It may sound cliché, but the initial hurdle for actually switching to SBT was gaining an in-depth understanding of the tool.  SBT’s terminology and model of a build is very different from Ant.  SBT’s documentation is very good and very thorough, however.  Starting the switch without having read through it would have resulted in a lot of wasted time.  It’s a lot easier to find answers online when you know how SBT names things!

The primary goal when switching was to have the SBT build be a drop-in replacement for the Ant build.  This removed the need to modify any external processes that use the build artifacts.  It also allowed us to have the Ant and SBT builds in parallel for a short time in case of bugs in the SBT build.  Unfortunately, our project layout did not conform to SBT’s conventions — but SBT is fully configurable in this regard.  SBT’s “inspect” command was helpful here for discovering which settings to tweak.  It was also good that SBT supports using an external ivy.xml file for dependency management.  We were already using Ivy with our Ant build, so we were able to define our dependencies in one place while running both builds in parallel.

With the configuration to match our project layout taken care of, actually defining the build with SBT was a straightforward task.  Unlike Ant, SBT has built-in tasks for common operations, like compiling, running tests, and creating a JAR.  SBT’s multi-project build support also cut down on the amount of configuration necessary.  We have multiple projects building together, and every single Ant task had to be defined to build them in the correct order.  Once we configured the dependencies between the projects in SBT, it automatically guaranteed this.

We also took advantage of SBT’s aliasing feature to make the switch easier on users.  The names of SBT’s built-in tasks did not always align with the names we had picked when defining the corresponding task in Ant.  It’s very easy to alias commands, however:

addCommandAlias("jar", "package")

With such aliases users were able to start using SBT just as they had used Ant, without needing to learn a whole set of new commands.  Aliases also make it easy to define sequences of commands without needing to create an entirely new task, such as

addCommandAlias("rebuild", ";clean; compile; package")

The process of switching was not entirely smooth, however.  We did run into two bugs in SBT itself.  The first is triggered when you define an alias that is the prefix of a task in the alias definition.  Tab-completing or executing that alias causes SBT to hang for some time and eventually die with a StackOverflowError.  This bug is easy to avoid if you define your alias names appropriately, and it is fixed in SBT 0.13.2.  The other bug only comes up with multi-module projects.  Even though our modules have many dependencies in common, SBT will re-resolve these dependencies for each module.  There is now a fix for this in SBT 0.13.6, the most recently released version, that can be enabled by adding

updateOptions := updateOptions.value.withConsolidatedResolution(true)

to your build definition.  We saw about a 25% decrease in time spent in dependency resolution as a result.

Custom Plugins

As previously mentioned, we had to write several custom plugins during the process of switching to SBT to reproduce all the functionality of our Ant build.  We are now open-sourcing two of these SBT plugins.  The first is sbt-checkstyle-plugin.  This plugin allows running Checkstyle over Java sources with the checkstyle task.  The second is sbt-compile-quick-plugin.  This plugin allows you to compile and package a single file with the compileQuick and packageQuick tasks, respectively.  Both of these plugins are available in Maven Central.


Switching build tools isn’t a trivial task, but it has paid off.  Switching to SBT has allowed us to address multiple pain points with our previous Ant build.  We’ve been using SBT for several weeks now.  As with any new tool, there was a need for some initial training to get everyone started.  Overall though, the switch has been a success!   The number of issues encountered with the build has dropped.  As users learn SBT they are taking advantage of features like triggered execution to increase their productivity.


Expanding Diversity Efforts with Hacker School

Posted by on September 25, 2014 / 2 Comments

Today we’re proud to announce that Etsy will provide $210,000 in Etsy Hacker Grants to Hacker School applicants in the coming year. These grants extend our support of Hacker School’s diversity initiatives, which first focused on the gender imbalance of applicants to their program and in the wider industry, and will now expand to support applicants with racial backgrounds that are underrepresented in software engineering.

The grants will provide up to $7,000 in support for at least 35 accepted applicants in the next three batches of Hacker School, and are designed to help with a student’s living expenses during their three-month curriculum in New York City.

Diversity and opportunity lie at the heart of what Etsy is about, a place where anyone can start a business for $0.20. Today we wrote a post talking about how we think about diversity at Etsy: More Than Just Numbers. As an engineering team, diversity and opportunity are core values for us as well — we believe a diverse environment is a resilient one. We love what we do and want to extend that opportunity to anyone who wants it.

This doesn’t mean the work to support diversity is easy or that we’ve mastered it — but we are committed to continuing to improve. Over the years, we’ve focused on educational programs looking at unconscious bias, bringing in speakers from NCWIT, and building out our internal leadership courses to support a broad swath of new leaders.

Hacker School grants have been one of our favorite and most effective programs since sponsoring our first batch of students in summer 2012. We’ve even given talks about how well it went. Hacker School’s selection process and environment combine to create a group of students diverse across a number of axes, including gender, race, experience and technical background, but that are also culturally and technically aligned with Etsy’s engineering team. Hacker School’s welcoming “programming retreat” approach produces the sort of broad, deep, tool-building and curious system engineers that work well in our distributed, iterative, transparent and scaled environment. We have Hacker School alums across almost every team in Etsy engineering and at nearly every level, from just starting their first job to very senior engineers.

We know that racial diversity is a complicated issue, and we are by no means the experts. But we believe that together with Hacker School we are making a small step in the right direction.

And we need your help. This program only works if qualified applicants hear that it’s happening, and know that we really want them to apply. If you know someone who you think would be great, please let them know, and encourage them to apply to an upcoming batch!


Come find Etsy at Velocity NY 2014

Posted by on September 10, 2014

Velocity is our kind of conference, and Velocity NY happens in our backyard in that funny little borough on the other side of the river. (Manhattan) Make sure to come find us, we’ll be there teaching, speaking, and keynoting:

Monday 9am “PostMortem Facilitation: Theory and Practice of “New View” Debriefings” – John Allspaw

Monday 1:30pm “Building a Device Lab” – Lara Swanson and Destiny Montague

Tuesday 1:15pm “A Woman’s Place is at the Command Prompt” – Lara Swanson (moderator), Katherine Daniels (Etsy), Jennifer Davis (Chef), Bridget Kromhout (DramaFever), Samantha Thomas (UMN)

Tuesday 3:30pm “It’s 3AM, Do You Know Why You Got Paged?” – Ryan Frantz

Tuesday 5pm “Etsy’s Journey to Building a Continuous Integration Infrastructure for Mobile Apps” – Nassim Kammah

Wednesday 11:20am “Unpacking the Black Box: Benchmarking JS Parsing and Execution on Mobile Devices” – Daniel Espeset

Holding office hours

Nassim Kammah of our continuous integration team, the Autobot

Ryan Frantz our sleep and alert design obsessed operations engineer

John Allspaw, he should look familiar to you by now

Signing books

Jon Cowie will be signing his book “Customizing Chef”
Lara Swanson will be signing galleys of “Designing for Performance”
John Allspaw will be signing “Web Operations” and “Capacity Planning”

No Comments

Teaching Testing: Our Testing 101 Materials

Posted by on August 20, 2014 / 2 Comments

Etsy engineers have a wide variety of backgrounds, strengths, and weaknesses, so there are no engineering skills we can take for granted. And there are things you can’t just assume engineers will learn for themselves because you throw a codebase and a workflow at them.

I work on Etsy’s continuous deployment team, which advises on automated testing of our code, and I felt that we could use some stronger means of teaching (and establishing as a conscious value) the skills of testing and design in code. To that end, I recently wrote two “Testing 101” materials for use by all Etsy engineers. They’re now both on our public Github: the Etsy Testing Best Practices Guide, and our Testing 101 Code Lab for hands-on practice applying its ideas. Both use PHP and PHPUnit.

We called it the “Testing Best Practices Guide” because we love misnomers. It’s more about design than testing, it describes few concrete practices, and we don’t call any of them “best” .

Within Etsy, we supplement mere documents with activities like team rotations (“bootcamps”) for new hires, technical talks, and dojos (collaborative coding exercises) to practice and have fun with coding topics as a group. And of course, we do code review.

Deeper technical considerations are often invisible in code, so you have to find a way, whether by process, tooling, or teaching, to make them visible.


Q2 2014 Site Performance Report

Posted by on August 1, 2014 / 3 Comments

As the summer really starts to heat up, it’s time to update you on how our site performed in Q2. The methodology for this report is identical to the Q1 report. Overall it’s a mixed bag: some pages are faster and some are slower. We have context for all of it, so let’s take a look.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, July 16th:

Server Side Performance

A few things stand out in this data:

Median homepage performance improved, but the 95th percentile got slower. This is due to a specific variant we are testing which is slower than the current page. We made some code changes that improved load time for the majority of pageviews, but the one slower test variant brings up the higher percentiles.

The listing page saw a fairly large increase in both median and 95th percentile load time. There isn’t a single smoking gun for this, but rather a number of small changes that caused little increases in performance over the last three months.

Search saw a significant decrease across the board. This is due to a dedicated memcached cluster that we rolled out to cache “listing cards” on the search results page. This brought our cache hit rate for listing related data up to 100%, since we automatically refresh the cache when the data changes. This was a nice win that will be sustainable over the long term.

The shop page saw a big jump at the 95th percentile. This is again due to experiments we are running on this page. A few of the variants we are testing for a shop redesign are slower than the existing page, which has a big impact on the higher percentiles. It remains to be seen which of these variants will win, and which version of the page we will end up with.

Overall we saw more increases than decreases on the backend, but we had a couple of performance wins from code/architecture changes, which is always nice to see. Looking ahead, we are planning on replacing the hardware in our memcached cluster in the next couple of months, and tests show that this should have a positive performance impact across the entire site.

Synthetic Front-end Performance

As a reminder, these tests are run with Catchpoint. They use IE9, and they run from New York, London, Chicago, Seattle, and Miami every two hours. The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page. Here is that data:

Synthetic Performance

The render start metrics are pretty much the same across the board, with a couple of small decreases that aren’t really worth calling out due to rounding error and network variability. The “webpage response” numbers, on the other hand, are up significantly across the board. This is easily explained: we recently rolled out full site TLS, and changed our synthetic tests to hit https URLs. The added TLS negotiation time for all assets on the page bumped up the overall page load time everywhere. One thing we noticed with this change is that due to most browsers making six TCP connections per domain, we pay this TLS negotiation cost many times per page. We are actively investigating SPDY with the goal of sending all of our assets over one connection and only doing this negotiation once.

Real User Front-end Performance

As always, these numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

RUM Data

One change here is that we are showing an extra significant figure in the RUM data. We increased the number of beacons that we send to mPulse, and our error margin dropped to 0.00 seconds, so we feel confident showing 10ms resolution. We see the expected drop in search load time because of the backend improvement, and everything else is pretty much neutral. The homepage ticked up slightly, which is expected due to the experiment that I mentioned in the server side load time section.

One obvious question is: “Why did the synthetic numbers change so much while the RUM data is pretty much neutral?”. Remember, the synthetic numbers changed primarily because of a change to the tests themselves. The switch to https caused a step change in our synthetic monitoring, but for real users the rollout was gradual. In addition, real users that see more than one page have some site resources in their browser cache, mitigating some of the extra TLS negotiations. Our synthetic tests always operate with an empty cache, which is a bit unrealistic. This is one of the reasons why we have both synthetic and RUM metrics: if one of them looks a little wonky we can verify the difference with other data. Here’s a brief comparison of the two, showing where each one excels:

Synthetic Monitoring Real User Monitoring
Browser Instrumentation Navigation Timing API
Consistent Trending over time Can be highly variable as browsers and networks change
Largely in your control Last mile difficulties
Great for identifying regressions Great for comparing across geographies/browsers
Not super realistic from an absolute number point of view “Real User Monitoring”
A/B tests can show outsized results due to empty caches A/B tests will show the real world impact


This report had some highs and some lows, but at the end of the day our RUM data shows that our members are getting roughly the same experience they were a few months ago performance wise, with faster search results pages. We’re optimistic that upgrading our memcached cluster will put a dent in our backend numbers for the next report, and hopefully some of our experiments will have concluded with positive results as well. Look for another report from us as the holiday season kicks off!


Just Culture resources

Posted by on July 18, 2014 / 2 Comments

This is a (very) incomplete list of resources that may be useful to those wanting to find out more about the ideas behind Just Culture, blamelessness, complex systems, and related topics. It was created to support my DevOps Days Minneapolis talk Fallible Humans.

Human error and sources of failure

Just Culture

Postmortems and blameless reviews

Complex systems and complex system failure


Calendar Hacks

Posted by on July 15, 2014 / 7 Comments

As an engineering manager, there’s one major realization you have: managers go to lots of meetings. After chatting with a bunch of fellow engineering managers at Etsy, I realized that people have awesome hacks for managing their calendars and time. Here are some of the best ones from a recent poll of Etsy engineering managers! We’ll cover tips on how to:

To access any of the Labs settings/apps:


Block out time

Create big unscheduled blocks every week. It allows for flexibility in your schedule. Some block out 3 afternoons/week as “office hours — don’t schedule unless you’re on my team”. It creates uninterrupted time when I’m *not* in meetings and available on IRC. Some book time to work on specific projects, and mark the event as private. They’ll try to strategize the blocking to prevent calendar fragmentation.

2-auto-declineAutomatically decline events (Labs): Lets you block off times in your calendar when you are unavailable. Invitations sent for any events during this period will be automatically declined. After you enable this feature, you’ll find a “Busy (decline invitations)” option in the “Show me as” field.

Office Hours: Blocks for office hours allow you to easily say, “yes, I want to talk to you, but can we schedule it for a time I’ve already set aside?” Better than saying “No, I cannot talk to you, my calendar is too full.” (Which also has to happen from time to time.) When you create a new event on your calendar, choose the “Appointment slots” link in the top of the pop-up:


Then follow the instructions to create a bookable slot on your calendar. You’ll be able to share a link to a calendar with your bookable time slots:


Declining meetings: Decline meetings with a note. Unless you work with the organizer often, do this automatically if there’s no context for the meeting. One manager has declined 1 hour meetings, apologized for missing them, asked for a follow up, and found that he really wasn’t needed in the first place. Hours saved!

Change your defaults

Shorter meetings: Don’t end meetings on the hour; use 25/50 minute blocks. You can set this in Calendars>My Calendars>Settings>General>Default Meeting Length. If you set your calendar invites to 25 minutes or 55 minutes, you need to assert that socially at the beginning of the meeting, and then explicitly do a time check at 20 minutes (“we have 5 minutes left in this meeting”). If possible, start with the 25 (or 30) minute defaults rather than the hour-long ones.


Visible vs private: Some make their calendar visible, not private by default. This lets other people see what they’re doing so they have a sense of whether they can schedule against something or not–a team meeting–maybe, mad men discussion group, no way.

Custom view: Create a custom 5- or 7-day view or use Week view.


Color coding:

Gentle Reminders (Labs): This feature replaces Calendar’s pop-ups: when you get a reminder, the title of the Google Calendar window or tab will happily blink in the background and you will hear a pleasant sound.

Event dimming: dim past events, and/or recurring future events, so you can focus.


Editability: Make all meetings modifiable and including this in the invite description, so as to avoid emails about rescheduling (“you have modifying rights — reschedule as needed, my schedule is updated”) and ask reports and your manager to do the same when they send  invites. This can result in fewer emails back and forth to reschedule.

Rely on apps and automation

Sunrise: Sunrise for iOS does the right thing and doesn’t show declined events, which weaponizes the auto-decline block.

8-world-clockWorld Clock (Labs) helps you keep track of the time around the world. Plus: when you click an event, you’ll see the start time in each time zone as well. You could alternatively add an additional timezone (settings -> general -> Your Time Zone -> Additional Timezone).

Free or busy (Labs): See which of your friends are free or busy right now (requires that friends share their Google Calendars with you.

Next meeting (Labs): See what’s coming up next in your calendar.

Soft timer: Set a timer on your phone by default at most of meetings, group or 1:1s,  and tell folks you’re setting an alarm to notify everyone when you have 5 minutes left. People love this, especially in intense 1:1s, because they don’t have to worry about the time. It can go really well as a reminder to end in “Speedy Meeting” time.

Do routine cleanup

Calendar review first thing in the morning. Review and clean up next three days. Always delete the “broadcast” invites you’re not going to attend.

Block off some empty timeslots first thing in the morning to make sure you have some breaks during the day—using Busy / Auto Decline technology.

People don’t seem to notice when you note all day events at the top of your calendar. If you’re going offsite,  book an event that’s 9am-7pm that will note your location.

Think ahead (long-term)

Book recurring reminders: For instance, do a team comp review every three months. Or if there is a candidate we lost out on, make appointments to follow up with them in a few months.

Limit Monday recurring meetings: Holiday weekends always screw you up when you have to reschedule all of those for later in the week.

Track high-energy hours: “I tracked my high energy hours against my bright-mind-needed tasks and lo-and-behold, realized that mornings is when I need a lot of time to do low-urgency/high importance things that otherwise I wasn’t making time for or I was doing in a harried state. I thus time-blocked 3 mornings a week where from 8am to 11am I am working on these things (planning ahead, staring at the wall thinking through a system issue, etc). It requires a new 10pm to 6am sleeping habit, but it has been fully worth it, I feel like I gained a day in my way. This means I no longer work past 6pm most days, which was actually what was most draining to me.”

Create separate calendars

Have a shared calendar for the team for PTO tracking, bootcampers, time out of the office, team standups etc.

Subscribe to the Holidays in the United States Calendar so as to not be surprised by holidays: Calendar ID: en.usa#holiday@group.v.calendar.google.com

Subscribe to the conference room calendars if that’s something your organization has.

Create a secondary calendar for non critical events, so they stay visible but don’t block your calendar. If there’s an event you are interested in, but haven’t decided on going or not, and don’t want other people to feel the need to schedule around it, you can go the event and copy it. Then remove it from your primary calendar. You can toggle the secondary calendar off via the side-panel, and if someone needs to set something up with you, you’ll be available.9-mergeWhen using multiple calendars, there may be events that are important to have on multiple calendars, but this takes up a lot of screen real estate. In these cases, we use Etsy engineer Amy Ciavolino’s Event Merge for Google Calendar Chrome extension. It makes it easy to see those events without them taking up precious screen space.

And, for a little break from the office, be sure to check out Etsy analyst Hilary Parker’s Sunsets in Google Calendar (using R!).


Threat Modeling for Marketing Campaigns

Posted by on July 7, 2014 / 1 Comment

Marketing campaigns and referrals programs can reliably drive growth. But whenever you offer something of value on the Internet, some people will take advantage of your campaign. How can you balance driving growth with preventing financial loss from unchecked fraud?

In this blog post, I want to share what we learned about how to discourage and respond to fraud when building marketing campaigns for e-commerce websites. I personally found there wasn’t a plethora of resources in the security community focused on the specific challenges faced by marketing campaigns, and that motivated me to try and provide more information on the topic for others — hopefully you find this helpful!


Since our experience came from developing a referrals program at Etsy, I want to describe our program and how we created terms of service to discourage fraud. Then rather than getting into very specific details about what we do to respond in real-time to fraud, I want to outline useful questions for anyone building these kinds of programs to ask themselves.

Our Referrals Program

We encouraged people to invite their friends to shop on the site. When the friend created an account on Etsy, we gave them $5.00 to spend on their first purchase. We also wanted to reward the person who invited them, as a thank-you gift.

Invite Your Friends - Recipient Promo

Of course, we knew that any program that gives out free money is an attractive one to try and hack!

Invite Your Friends - Sender Promo

Discourage Bad Behavior from the Start

First, we wanted to make our program sustainable through proactive defenses. When we designed the program we tried to bake in rules to make the program less attractive to attackers. However, we didn’t want these rules to introduce roadblocks in the product that made the program less valuable from users’ perspectives, or financially unsustainable from a business perspective.

In the end, we decided on the following restrictions. We wanted there to be a minimum spend requirement for buyers to apply their discount to a purchase. We hoped that requiring buyers to put in some of their own money to get the promotion would attract more genuine shoppers and discourage fraud.

We also put limits on the maximum amount we’d pay out in rewards to the person inviting their friends (though we’re keeping an eye out for any particularly successful inviters, so we can congratulate them personally). And we are currently only disbursing rewards to inviters after two of their invited friends make a purchase.

Model All Possible Scenarios

A key principle of Etsy’s approach to security is that it’s important to model out all possible approaches of attack, notice the ones that are easiest to accomplish or result in the biggest payouts if successful, and work on making them more difficult, so it becomes less economical for fraudsters to try those attacks.

In constructing these plans, we first tried to differentiate the ways our program could be attacked and by what kind of users. Doing this, we quickly realized that we wanted to respond differently to users with a record of good behavior on the site from users who didn’t have a good history or who were redeeming many referrals, possibly with automated scripts. We also wanted to have the ability to adjust the boundaries of the two categories over time.

So the second half of our defense against fraud consisted of plans on how to monitor and react to suspected fraud as well as how to prevent attackers in the worst-case scenario from redeeming large amounts of money.

Steps to Develop a Mitigation Plan

I have to admit upfront, I’m being a little ambiguous about what we’ve actually implemented, but I believe it doesn’t really matter since each situation will differ in the particulars. That being said, here are the questions that guided the development of our program, that could guide your thinking too.


# 1. How can you determine whether two user accounts represent the same person in real life?

This question is really key. In order to detect fraud on a referrals program, you need to be able to tell if the same person is signing up for multiple accounts.

In our program, we kept track of many measures of similarity. One important kind of relatedness was what we called “invite relatedness.” A person looking to get multiple credits is likely to have generated multiple invites off of one original account. To check for this, and other cases, we had to keep a graph data structure of whether one user had invited another via a referral and do a breadth first search to determine related accounts.

Another important type of connection between accounts is often called “fingerprint relatedness.” We keep track of fingerprints (unique identifying characteristics of an account) and user to fingerprint relationships, so we can look up what accounts share fingerprints. There are a lot of resources available about how to fingerprint accounts in the security community that I would highly recommend researching!

Here’s an example of a very simple invite graph of usernames, colored by fingerprint relatedness. As you can see, the root user SarahJ28 might have invited three people, but two of them are related to her via other identifying characteristics.

Simple Invite Graph


# 2. At what point in time are all the different signals of identity discussed in the previous question available to you?

You don’t know everything about a user from the instant they land on your site. At that point in time, you might only have a little bit of information about their IP and browser.  You start to learn a bit more about them based on their email if they sign up for an account, and you certainly have more substantive information for analysis when they perform other actions on the site, like making purchases.

Generally, our ability to detect identity gets stronger the more engaged someone is with the site, or the closer they move towards making a purchase. However, if someone has a credit on their account that you don’t want them to have, you need to identify them before they complete their purchase. The level of control you have over the process of purchasing will depend on how you process credit card transactions on your site.


# 3. What are the different actions you could take on user accounts if you discover that one user has invited many related accounts to a referrals program?

There’s generally a range of different actions that can be taken against user accounts on a site. Actions that were relevant to us included: banning user accounts, taking away currently earned promotional credits, blocking promotional credits from being applied to a purchase transaction, and taking away the ability to refer more people using our program.

 Signals of identity become stronger with more site engagement


# 4. Do any actions need to be reviewed by a person or can they be automatic? What’s the cost of doing a manual review?On the other hand, how would a user feel if an automated action was taken against their account based on a false positive?

We knew that in some cases we would feel comfortable programming automated consequences to user accounts, while in other cases we wanted manual review of the suspected account first. It was really helpful for us to work with the teams who would be reviewing the suspected accounts on this from the beginning. They had seen lots of types of fraud in the past and helped us calibrate our expectations around false positives from each type of signal.

Luckily for us at Etsy, it’s quite easy to code the creation of tasks for our support team to review from any part of our web stack. I highly recommend architecting this ability if you don’t already have it for your site because it’s useful in many situations besides this one. Of course, we had to be very mindful that the real cost would come from continually reviewing the tasks over time.


# 5. How can you combine what you know about identity and user behavior (at each point in time) with your range of possible actions to come up with checks that you feel comfortable with? Do these checks need to block the action a user is trying to take?

We talked about each of these points where we evaluated a user’s identity and past behavior as a “check” that could either implement one of these actions I described or create a task for a person to review.

This meant we also had to decide whether the check needed to block the user from completing an action, like getting a promotional credit on their account, or applying a promotional credit to a transaction, and how that action was currently implemented on the site.

It’s important to note that if the user is trying to accomplish something in the scope of a single web request, there is a tradeoff between how thoroughly you can investigate someone’s identity and how quickly you can return a response to the user. After all, there are over 30 million user accounts on Etsy and we could potentially need to compare fingerprints across all of them. To solve this problem, we had to figure out how to kick off asynchronously running jobs at key points (especially checkout) that could do the analysis offline, but would nevertheless provide the protection we wanted.


# 6. How visible are account statuses throughout your internal systems?

Once you’ve taken an automated or manual action against an account, is that clearly marked on that user account throughout your company’s internal tools? A last point of attack may be that someone writes in complaining their promo code didn’t work. If this happens, it’s important for the person answering that email to know they’ve been flagged as under investigation or deemed fraudulent.


# 7.  Do you have reviews of your system scheduled for concrete dates in the future? Can you filter fraudulent accounts from your metrics when you want to?

If your referrals campaign doesn’t have a concrete end date, then it’s easy to forget about how it’s performing, not just in terms of meeting business goals but in terms of fraud. It’s important to have an easy way to filter out duplicate user accounts to calculate true growth, as well as how much of an effect unchecked fraud would have had on the program and how much was spent on what was deemed an acceptable, remaining risk. If we had discovered that too much was being spent on allowing good users to get away with a few duplicate referrals, we could have tightened our guidelines and started taking away the ability to send referrals from accounts more frequently.

We found that when we opened up tasks for manual review, the team member reviewing them marked them as accurate 75% of the time. This was pretty good relative to other types of tasks we review. We were also pretty generous in the end in trusting that multiple members of a household might be using the same credit card info.


Our project revealed that some fraud is catastrophic and should absolutely be prevented, while other types of fraud, like a certain level of duplicate redemptions in marketing campaigns, are less dangerous and require a gentler response or even a degree of tolerance.

We have found it useful to review what could happen, design the program with rules to discourage all kinds fraud while keeping value to the user in mind, have automated checks as well as manual reviews, and monitoring that includes the ability to segment the performance of our program based on fraud rate.

Many thanks to everyone at Etsy, but especially our referrals team, the risk and integrity teams, the payments team, and the security team for lots of awesome collaboration on this project!

1 Comment