Q1 2016 Site Performance Report

Posted by on April 28, 2016 / 1 Comment

Spring has sprung and we’re back to share how Etsy’s performance fared in Q1 2016. In order to analyze how the site’s performance has changed over the quarter, we collected data from a week in March and compared it to a week’s worth of data from December. A common trend emerged from both our back-end and front-end testing this quarter: we saw significant site-wide performance improvements across the board.

Several members of Etsy’s performance team have joined forces to chronicle the highlights of this quarter’s report. Moishe Lettvin will start us off with a recap of the server-side performance, Natalya Hoota will discuss the synthetic front-end changes and Allison McKnight will address the real user monitoring portion of the report. Let’s take a look at the numbers and all the juicy context.

Server-Side Performance

Server-side time measures how long it takes our servers to build pages. These measurements don’t include any client-side time — this measurement is the amount of time it takes from receiving an HTTP request for a page to returning the response.

As always, we start with these metrics because they represent the absolute lowest bound for how long it could take to show a page to the user, and changes in these metrics will be reflected in all our other metrics. This data is calculated by a random sample of our webserver logs.

Happily, this quarter we saw site-wide performance improvements, due to our upgrade to PHP7. While our server-side time isn’t spent solely running PHP code (we make calls to services like memcache, MySQL, Redis, etc.), we saw significant performance gains on all our pages.

server-side-q1-2016

One of the primary ways that PHP7 increases performance is by decreasing memory usage. In the graph below, the blue & green lines represent 95th percentile of memory usage on a set of our PHP5 web servers, while the red line indicates the memory usage on a set of our PHP7 web servers during our switch-over from PHP5 to PHP7. This is two days’ worth of data; the variance in memory usage (more visible in the PHP5 data) is due to daily variation in server load. Note also that the y-axis origin is 0 on this graph — the decrease shown in this graph is to scale!

aggregate_memory

Another interesting thing that’s visible in the box plots above is how the homepage server-side timing changed. It sped up across the board, but the distribution widened significantly. The reason for this is that the homepage gets many signed-out views as well as signed-in views, as contrasted with, for instance, the cart view page. In addition, the homepage is much more customized for signed-in users than, say, the listing page. This level of customization requires calls to our storage systems, which weren’t sped up by the PHP7 upgrade. Therefore, the signed-in requests didn’t speed up as much as the signed-out requests, which are constrained almost entirely by PHP speed. In the density plot below, you can see that the bimodal distribution of the homepage timing became more distinct in Q1 (red) vs Q4 of last year (blue):

bimodial

The Baseline page gives us the clearest view into the gains from PHP7 — the page does very little outside of running PHP code, and you can see in the first chart above that the biggest impact was on the Baseline page. The median time went from 80ms to 29ms, and the variance decreased.

Synthetic Start Render

We collect synthetic (i.e., obtained by web browser emulation of scripted web visits) monitoring results in order to cross-check our findings for two other sets of data. We use a third party provider, Catchpoint, to record start render time — a moment when a user first sees content appearing on the screen — as a reference point.

Synthetic tests showed that all pages got significantly faster in Q1.

synthetic-Q1-2016

It is worth noting that the data for synthetic tests is collected for signed-out web requests. As previously mentioned, this type of request involves fetching less state from storage, which highlights PHP7 wins.

I noticed the unusually far-reaching outliers on the shop page and search pages and decided to investigate further. At first, I generated scatterplots for the two pages in question, which clarified that outliers on their own did not have a story to tell — no clustering patterns or high volumes on any particular day. Having a better data sanitation process would have eliminated the points in question altogether.

In order to get a better visual for the synthetic data we used, I took comparative scatterplots of all six pages we monitor. I noticed a reduction of failed tests (marked by red diamonds) that happened sometime between Q4 and present day. Remarkably, we have never isolated that data set before. A closer look revealed an important source of many current failures: fetching third party resources was taking longer than the maximum time allotted by Catchpoint. That encouraged us to consider a new metric for monitoring the impact of third party resources.

scatterPlot

Real User Page Load Time

Gathering front-end timing metrics from real users allows us to see the full range of our pages’ performance in the wild. As in past reports, we’re using page load metrics collected by mPulse. The data used to generate these box plots is the median page load time for each minute of one week in each quarter.

RUM-q1-2016

We see that for the most part, page load times have gone down for each page.  The boxes and whiskers of each plot have moved down, and the outliers tend to be faster as well. But it seems that we’ve also picked up some very slow outliers: each page has a single outlier of over 10 seconds, while the rest of the data points for all pages are far under that. Where did these slow load times come from?

Because each of our RUM data points represents one minute in the week and there is exactly one extremely slow outlier for each page, it makes sense that these points might be all from the same time. Sure enough, looking at the raw numbers shows that all of the extreme outliers are from the same minute!

This slow period happened when a set of boxes that queue Gearman jobs were taken out to be patched for the glibc vulnerability that surfaced this quarter. We use Gearman to run a number of asynchronous jobs that are used to generate our pages, so when the boxes were taken out and fewer resources were available for these jobs, back-end times suffered.

One interesting thing to note is that we actually didn’t notice when this happened. The server-side times (and therefore also the front-end times) for most of our pages suffered an extreme regression, but we weren’t notified. This is actually by design!

Sometimes we experience a blip in page performance that recovers very quickly (as was the case with this short-lived regression). It makes little sense to scramble to understand an issue that will automatically resolve within a few minutes — by the time we’ve read and digested the alert, it may have already resolved itself — so we have a delay built in to our alerts. We are only alerted to a regression if a certain amount of time has passed and the problem hasn’t been resolved (for individual pages, this is 40 minutes; when a majority of our pages’ performance degrades all at once, the delay is much shorter).

This delay ensures that we won’t scramble to respond to an alert if the problem will immediately fix itself, but it does mean that we don’t have any insight into extreme but short-lived regressions like this one, and that means we’re missing out on some important information about our pages’ performance. If something like this happens once a week, is it still something that we can ignore? Maybe something like this happens once a day — without tracking short-lived regressions, we don’t know. Going forward, we will investigate different ways of tracking short-lived regressions like this one.

You may also have noticed that while the slowdown that produced these outliers originated on the server side, the outliers are missing from both the server-side and synthetic data. This is because of the different collection methods that we use for each type of data. Our server-side and synthetic front-end datasets each contain 1,000 data points sampled from visits throughout the week (with the exception of the baseline page, which has a smaller server-side dataset because it receives fewer visits than other pages). This averages to only 142 data points per day — far under one datapoint per minute — and so it’s likely that no data from the short regression made it into the synthetic or server-side datasets at all. Our front-end RUM data, on the other hand, has one datapoint — the median page load time — for every minute, so it was guaranteed that the regression would be represented in that dataset as long as at least 50% of views were affected.

The nuances in the ways that we collect each metric are certainly very interesting, and each method has its pros and cons (for example, heavily sampling our server-side and synthetic front-end data leads to a narrower view of our pages’ performance, but collecting medians for our front-end RUM data to display in box plots is perhaps statistically unsound). We plan to continue iterating on this process to make it more appropriate to the report and more uniform across our different monitoring stacks.

Conclusion

In the first quarter of 2016, performance improvements resulting from the server-side upgrade to PHP7 trickled down through all our data sets and the faster back-end times translated to speed ups in page load times for users. As always, the process of analyzing the data for the sections above uncovered some interesting stories and patterns that we may have otherwise overlooked. It is important to remember that the smaller stories and patterns are just as valuable of learning opportunities as the big wins and losses.

1 Comment

How Etsy Formats Currency

Posted by on April 19, 2016 / 20 Comments

Imagine how you would feel if you went into a grocery store, and the prices were gibberish (“1,00.21 $” or “$100.A”). Would you feel confident buying from this store?

Etsy does business in more than 200 regions and 9 languages. It’s important that our member experience is consistent and credible in all regions, which means we have to format prices correctly for all members.

In this post, I’ll cover:

In order to follow along, you need to know one important thing: Currency formatting depends on three attributes: the currency, the member’s location, and the member’s language.

Examples of bad currency formatting

Here are some examples of bad currency formatting:

If you don’t know why the examples above are confusing, read on.

What’s wrong with: A member browsing in German goes to your site and sees an item for sale for “1,000.21 €”?

The first example is the easiest. If a member is browsing in German, the commas and decimals in a price should be flipped. So “1,000.21 €” should really be formatted as “1.000,21 €”. This isn’t very confusing (as a German member, you can figure out what the price is *supposed* to be), but it is a bad experience.

By the way, if you are in Germany, using Euros, but browsing in English, what would you expect to see? Answer: “€1,000.21”. The separators and symbol position are based on language here, not region.

What’s wrong with: A Japanese member sees an item selling for “¥ 847,809.34”?

Japanese Yen doesn’t have a fractional part. There’s no such thing as half a Yen. So “¥ 847,809.34” could mean “¥ 847,809”, or “¥ 84,780,934” or something else entirely.

What’s wrong with: A Canadian member sees “$1.00”?

If your site is US-based, this can be confusing. Does “$” mean Canadian dollar or US dollar here? A simple fix is to add the currency code at the end: “$1.00 USD”.

How to format currency correctly

Etsy's locale settings picker

Etsy’s locale settings picker

Formatting currency for international members is hard. Etsy supports browsing in 9 languages, 23 currencies, and hundreds of regions. Luckily, we don’t have to figure out the right way to format in all of these combinations, because the nice folks at CLDR have done it for us. CLDR is a massive database of formatting styles that gets updated twice a year. The data gets packaged up into a portable library called libicu. Libicu is available everywhere, including mobile phones. If you want to format currency, you can use CLDR data to do it.

For each language + region + currency combination, CLDR gives you:

A typical pattern looks like this:

A cldr pattern (#,##0.00)

A cldr pattern

This is the pattern for German + Germany + Euros. It tells you:

NOTE: the pattern does *not* tell you what the decimal and grouping separators are. CLDR gives you those separately, they are not a part of the pattern.

Now you can use this information to format a value:

#,##0.## translates to 1.000,21

If you want to format prices using CLDR, your language might have libraries to do it for you already. PHP has NumberFormatter, for example. JavaScript has Intl.NumberFormat.

Practical implementation decisions

CLDR is great, but it is not the ultimate authority. It is a collaborative project, which means that anyone can add currency data to CLDR, and then everyone votes on whether the data looks correct or not. People can also vote to change existing currency data.

CLDR data is not a precise thing, it is fluid and changing. Sometimes you need to customize CLDR for your use case. Here are the customizations we made.

The problem with currencies that use a dollar sign ($)

We use CLDR to format currency at Etsy, but we’ve made some changes to it. One issue in particular has really bugged us. Dollar currencies are really hard to work with. The symbol for CAD (Canadian dollars) is “$” in Canada, but it is “CA$” in the US and everywhere else to avoid confusion with US Dollars. So if we followed CLDR, Canadian members would see “$1.00”. But our Canadian members might know that Etsy is a US-based company, in which case “$” would be ambiguous to them — it could mean either Canadian dollars or US dollars. Here is how we choose a currency symbol to avoid confusion while still meeting member expectations:

What symbol does Etsy use for dollar-based currencies?

What symbol does Etsy use for dollar-based currencies?

Here is the value “1000.21” formatted in different currency + region combinations:

05_table

You might be wondering, why not just add the currency code to the end of the price? For example, it could be “$1,000.21 USD” for US dollars, and “$1,000.21 CAD” for Canadian dollars. This is also explicit but we don’t need to have complicated logic to change the currency symbol. But this approach has another issue: redundancy.

Suppose we did add the currency code at the end everywhere to address the CAD problem. Euros would get formatted as “1.000,21 € EUR”, but the “€ EUR” is redundant. Even worse, Swiss Francs doesn’t have a currency symbol, so CLDR recommends using the currency code as the currency symbol. Which means they would see “1.000,21 CHF CHF”, which is definitely redundant:

Adding the currency code at the end is explicit, but doesn’t meet member expectations. Our German members said they didn’t like how “1.000,21 € EUR” looked.

In the end Etsy decided not to show the currency code. Instead, we change the currency symbol as needed to avoid confusion.

 

Listing price with settings English / Canada / Canadian dollars

Listing price with settings English / Canada / Canadian dollars

Overriding CLDR data

Here’s a simple case where we overrode CLDR formatting. We are a website, so of course we want our prices to be wrapped in html tags so that they can be styled appropriately. For example, on our listings manager, we want to format price input boxes correctly based on locale:

08_input_old

vs

09_input_new

It’s hard to wrap a price in html tags *after* you have done the formatting: sometimes the symbol is at the end, sometimes there’s a space between the symbol and value, and sometimes there isn’t, etc etc. To make this work, the html tags need to be a part of the pattern, so we need to be able to override the CLDR patterns directly.

Ultimately we ended up overriding a lot of the default CLDR data:

Different libraries offered different levels of support for this. PHP’s NumberFormatter lets you override the pattern and symbol. JavaScript’s Intl.NumberFormat lets you override neither. None of the libraries had support for wrapping html tags around the output. In the end, we wrote our own JavaScript library and added wrappers for the rest.

Consistent formatting across platforms

We had to format currency in PHP, JavaScript, and in our iOS and Android apps. PHP, JavaScript, iOS and Android all had different versions of libicu, and so they had different CLDR data. How do we format consistently across these platforms? We went with a dual plan of attack: write tests that are the same across platforms, and make sure all CLDR overrides get shared between platforms.

We wrote a script that would export all our CLDR overrides as JSON / XML / plist. Every time the overrides change, we run the script to generate new data for all platforms. Here’s what our JSON file looks like right now (excerpt):

{
    "de_AU": {
        "symbol": {
            "AUD": "AU$",
            "BRL": "R$",
            "CAD": "CA$"
        },
        "decimal_separator": ",",
        "grouping_separator": ".",
        "pattern": {
            "AUD": "#,##0.00 \u00a4",
            "BRL": "#,##0.00 \u00a4",
            "CAD": "#,##0.00 \u00a4"
...

We wrote another script to generate test fixtures, which look like this (excerpt):

"test_symbol&&!code&&!html": {
    "de": {
        "DE": {
            "EUR": {
                "100000": "1.000,00 \u20ac",
                "100021": "1.000,21 \u20ac"
            }
        },
        "US": {
            "EUR": {
                "100000": "1.000,00 \u20ac",
                "100021": "1.000,21 \u20ac"
            },
            "USD": {
                "100000": "1.000,00 $",
                "100021": "1.000,21 $"
            }
        }
    }
}

This test says that given these settings:

We have hundreds of tests in total to check every combination of language/region/currency code with symbol shown vs. hidden, formatted as text vs. html, etc. These expected values get checked against the output of the currency formatters on all platforms, so we know that they all format currency correctly and consistently. Any time an override changes (for example, changing the symbol for CAD to be “CA$” in all regions), we update the CLDR data file so that the new override gets spread to all platforms. Then we update the test fixtures and re-run the tests to make sure the override worked on all platforms.

Conclusion

No more “¥ 847,809.34”! Formatting currency is hard. If you want to do it correctly, use the CLDR data, but make sure that you override it when necessary based on your unique circumstances. I hope our changes lead to a better experience for international members. Thanks for reading!

20 Comments

Building a Translation Memory to Improve Machine Translation Coverage and Quality

Posted by on March 22, 2016 / 2 Comments

Machine Translation at Etsy

At Etsy, it is important our global member base can communicate with one another, even when they don’t speak the same language. Whether users are browsing listings, messaging other users, or posting comments in the forums, machine translation is a valuable tool for facilitating multilingual interactions on our site and in our apps.

Listing descriptions account for the bulk of text we machine translate. With over 35 million active listings at an average length of nearly 1,000 characters, and 10 supported site languages, we need to translate a lot of content—and that’s just for listings. We also provide machine translation for listing reviews, forum posts, and conversations (messaging between members). We send text we need to translate to a third party machine translation service, and given the associated cost, there is a limit to the number of characters we can translate per month.

Listing Review

An example listing review translation.

While a user can request a listing translation if we don’t already have one (we call this on-demand translation), translating a listing beforehand and showing a visitor the translation automatically (we call this pre-translation) provides a more fluid browsing experience. Pre-translation also allows listings to surface in search results in multiple languages, both for searches on Etsy and on external search engines like Google.

The Benefits of a Translation Memory

Many of the strings we machine translate from one language to another are text segments we’ve seen before. Our most common segments are used in millions of listings, with a relatively small subset of distinct segments  accounting for a very large proportion of the content. For example, the sentence “Thanks for looking!” appears in around 500,000 active listings on Etsy, and has appeared in over 3 million now inactive listings.

Zipfian Shape

Frequency and rank of text segments (titles, tags, and description paragraphs) appearing in listings on Etsy. The distribution of segments roughly conforms to a classic Zipfian shape, where a string’s rank is inversely proportional to its frequency.

In the past, a single text segment that appeared in thousands of listings on Etsy would be re-translated once for every listing. It would also be re-translated any time a seller edited a listing. This was a problem: our translation budget was being spent on millions of repeat translations that would be better used to translate unique content into more languages.

To solve this problem, we built a translation memory. At its simplest, a translation memory stores a text segment in one language and a corresponding translation of that segment in another language. Storing strings in a translation memory allows us to serve translations for these strings from our own databases, rather than making repeated requests to the translation service.

Storing these translations for later reuse has two main benefits:

  1. Coverage: By storing common translations in the translation memory and serving them ourselves, we drastically reduce the number of duplicate segments we send to the translation service. This process lets us translate seven times more content for the same cost.

  2. Quality: We’re able to see which text segments are most commonly used on Etsy and have these segments human translated. Overriding these common segments with human translations improves the overall quality of our translations.

Initial Considerations

We had two main concerns when planning the translation memory architecture:

  1. Capacity: The more text segments we store in the translation memory, the greater our coverage. However, storing every paragraph from each of our more than 35 million active listings, and a translation of that paragraph for each of our 10 supported languages, would mean a huge database table. Historically, Etsy has rarely had tables exceeding a few billion rows, and we wanted to keep that maximum limit here.

  2. Deletions: The translation service’s quality is continually improving, and to take full advantage of these improvements we need to periodically refresh entries in the translation memory by deleting older translations. We wanted to be able to delete several hundred million rows on a monthly basis without straining system resources.

The Translation Memory Architecture

Our Translation Memory consists of several separate services, each handling different tasks. A full diagram of the pipeline is below:

TM Overview Diagram

* The external translation service is Microsoft Translator.

A brief overview of each step:

  1. Splitting into segments: The first step of the translation pipeline is splitting blocks of text into individual segments. The two main choices here were splitting by sentence or splitting by paragraph. We chose the latter for a few reasons. Splitting by sentence gave us more granularity, but our estimated Translation Memory hit rate was only 5% higher with sentences versus paragraphs. The increased hit rate wasn’t high enough to warrant the extra logic needed to split by sentence, nor the multi-fold database size increase to store every sentence, instead of just every paragraph. Moreover, although automatic sentence boundary detection systems can be quite good, a recent study evaluated the most popular systems on user-generated content and found that accuracy peaked at around 95%. In contrast, using newline characters to split paragraphs is straightforward and an almost error-free way to segment text.

  2. Excluder: The excluder is the first service we run translations through. It removes any text we don’t want to translate. For now this means lines containing only links, numbers, or special characters.

  3. Human Translation Memory (HTM): Before looking for a machine translation, we check first for an existing human translation. Human translations are provided by Etsy’s professional translators (the same people who translate Etsy’s static site content). These strings are stored in a separate table from the Machine Translation Memory and are updated using an internal tool we built, pictured below.

Human TM Interface

  1. Machine Translation Memory (MTM): We use sharded MySQL tables to store our machine translation entries. Sharded tables are a well-established pattern at Etsy, and the system works especially well for handling the large row count needed to accommodate all the text segments. As mentioned earlier, we periodically want to delete older entries in the MTM to clear out unused translations, and make way for improved translations from the translation service. We partition the MTM table by date to accommodate these bulk deletions. Partitioning allows us to drop all the translations from a certain month without worrying about straining system resources or causing lag in our master-master pairs.

  2. External Translation Service: If there is new translatable content that doesn’t exist in either our HTM or MTM, we send it to the translation service. Once translated, we store the segment in the MTM so it can be used again later.

  3. Re-stitching segments: Once each of the segments has passed through one of our four services, we stitch them all back together in the proper order.

The Results

We implemented the Excluder, HTM, and MTM in that order. Implementing the Excluder first allowed us to refine the text splitting, restitching, and monitoring aspects of the pipeline before worrying about data access. Next we built the HTM and populated it with several hundred translations of the most common terms on Etsy. Finally, at the end of November 2015, we began storing and serving translations from the MTM.

Translation Memory Rampup Graph

Coverage: As you can see from the graphs above, we now only send out 14% of our translations to the translation service, and the rest we can handle internally. Practically, this means we can pre-translate over seven times more text on the same budget. Prior to implementing the translation memory, we pre-translated all non-English listings into English, and a majority of the rest of our listings into French and German. With the translation memory in place, we are pre-translating all eligible listings into English, French, German, Italian, and Japanese, with plans to scale to additional languages.

Quality: Around 1% of our translations (by character count), are now served by the human translation memory. These HTM segments are mostly listing tags. These tags are important for search results and are easily mis-translated by an MT system because they lack the context a human translator can infer more easily. Additionally, human translators are better at conveying the colloquial tone often used by sellers in their listing descriptions. With the HTM in place, the most common paragraph on Etsy, “Thanks for looking!” is human translated into the friendlier, “Merci pour la visite !” rather than the awkward, “Merci pour la recherche !” The English equivalent of this difference would be, “Thanks for visiting!” versus “Thanks for researching!”

Monitoring: Since a majority of our translation requests are now routed to the MTM rather than the third-party translation service, we monitor our translations to make sure they are sufficiently similar to those served by the translation service. To do this, we sample 0.1% of the translations served from the MTM and send an asynchronous call to the translation service to provide a reference translation of the string. Then we log the similarity (the percentage of characters in common) and Levenshtein distance (also known as edit distance) between the two translations. As shown in the graph below, we track these metrics to ensure the stored MTM translations don’t drift too far from the original third party translations.

image07-1024x417

For comparison, as you can see below, the similarity for HTM translations is not as high, reflecting the fact that these translations were not originally drawn from the third party translation service.

image05-1024x406

Additional Benefits

Correcting mis-translations: Machine translation engines are trained on large amounts of data, and sometimes this data contains mistakes. The translation memory gives us more granular control over the translated content we serve, allowing us to override incorrect translations while the translation service we use works on a fix. Below is an example where “Realistic bird” is mis-translated into German as “Islamicrevolutionservice.”Realistic Bird Mis-translation

With the translation memory, we can easily correct problematic translations like this by adding an entry to the human translation memory with the original listing title and the correct German translation.

Respecting sellers’ paragraph choices: Handling paragraph splitting ourselves had the additional benefit of improving the quality of translation for many of our listings. Etsy sellers frequently include lists of attributes and other information without punctuation in their listings. For example:

Dimensioni 24×18 cm
Spedizione in una scatola protettiva in legno
Verrà fornito il codice di monitoraggio (tracking code)

The translation service often combines these lists into a single sentence, producing a translation like this:

Size 24 x 18 cm in a Shipping box wooden protective supplies the tracking code (tracking code)

By splitting on paragraphs, our sellers’ choice of where to put line breaks is now always retained in the translated output, generating a more accurate (and visually appealing) translation like this:

Size 24 x 18 cm
Shipping in a protective wooden box
You will be given the tracking code (tracking code)

Splitting on paragraphs prior to sending strings out for translation is an improvement we could have made independent of the translation memory, but it came automatically with the infrastructure needed to build the project.

Conclusion

Greater accuracy for listing translations means buyers can find the items they’re looking for more easily, and sellers’ listings are more faithfully represented when translated. To continue improving quality, over the next month we are rolling out a machine translation engine trained on top of the translation service’s generic engine. A machine translation engine customized with Etsy-specific data, in conjunction with more human translated content, will produce higher-quality translations that more closely reflect the colloquialisms of our sellers.

Building a community-centric, global marketplace is a core tenet of Etsy’s mission. Machine translation is far from perfect, but it can be a valuable tool when fostering an online community built around human interaction. The translation memory allows us to bring this internationalized Etsy experience to more users in more languages, making it easier to connect buyers and sellers from around the world.

Co-authored with Ben Russell

2 Comments

Putting the Dev in Devops: Bringing Software Engineering to Operations Infrastructure Tooling

Posted by on February 22, 2016 / 3 Comments

At Etsy, the vast majority of our computing happens on physical servers that live in our own data centers. Since we don’t do much in the cloud, we’ve developed tools to automate away some of the most tedious aspects of managing physical infrastructure. This tooling helps us take new hardware from initial power on to being production-ready in a manner of minutes, saving time and energy for both data center technicians racking hardware and engineers who need to bring up new servers. It was only recently, however, that this toolset started getting the love and attention that really exemplifies the idea of code as craft.

The Indigo Tool Suite

The original idea for this set of tools came from a presentation on Scalable System Operations that a few members of the ops team saw at Velocity in 2012. Inspired by the Collins system that Tumblr had developed but disappointed that it wasn’t yet (at the time) open source or able to work out of the box with our particular stack of infrastructure tools, the Etsy ops team started writing our own. In homage to Tumblr’s Phil Collins tribute, we named the first ruby script of our own operations toolset after his bandmate Peter Gabriel. As that one script grew into many, that naming scheme continued, with the full suite and all its components eventually being named after Gabriel and his songs.

While many of the technical details of the architecture and design of the tool suite as it exists today are beyond the scope of this post, here is a brief overview of the different components that currently exist. These tools can be broken up into two general categories based on who uses them. The first are components used by our data center team, who handle things like unboxing and racking new servers as well as hardware maintenance and upgrades:

The other set of tools are primarily used by engineers working in the office, enabling them to take boxes that have already been set up by the data center team and sledgehammer and get them ready to be used for specific tasks:

The interface to install a new server with the Gabriel tool

 

While many of the details of the inner workings of this automation tooling could be a blog post in and of themselves, the key aspect of the system for this post is how interconnected it is. Sledgehammer’s unattended mode, which has saved our data center team hundreds—if not thousands—of hours of adding server information to RackTables by hand, depends on the sledgehammer payload, sledgehammer executor, API, and the shared libraries that all these tools use all working together perfectly. If any one part of that combination isn’t working with the others, the whole thing breaks, which gets in the way of people, especially our awesome data center team, getting their work done.

The Problem

Over the years, many many features have been added to Indigo, and as members of the operations team worked to add those features, they tried to avoid breaking things in the process. But testing had never been high on Indigo’s list of priorities – when people started working on it, they thought of it more as a collection of ops scripts that “just work” rather than a software engineering project. Time constraints sometimes played a role as well – for example, sledgehammer’s unattended mode in all its complex glory was rolled out in one afternoon, because a large portion of our recent data center move was scheduled for the next day and it was more important at that point to get that feature out for the DC team to use than it was to write tests.

For years, the only way of testing Indigo’s functionality was to push changes to production and see what broke—certainly not an ideal process! A lack of visibility into what was being changed compounded the frustration with this process.

When I started working on Indigo, I was one of the first people to have touched that code that has a formal computer science background, so one of the first things I thought of was adding unit tests, like we have for so much else of the code we write at Etsy. I soon discovered that, because the majority of the Indigo code had been written without testability in mind, I was going to have to do some significant refactoring to even get to the point where I could start writing unit tests, which meant we had to first lay some groundwork in order to be able to refactor without being too disruptive to other users of these tools. Refactoring first without any way to test the impact of my changes on the data center team was just asking for everyone involved to have a bad time.

Adding Tests (and Testability)

Some of the most impactful changes we’ve made recently have been around finding ways to test the previously untestable unattended sledgehammer components. Our biggest wins in this area have been:

testhost.yml

payload: "sledgehammer-payload-0.5-test-1.x86_64.rpm"
unattended: "true"
unattended_run_recipient: "testuser@etsy.com"
indigo_url: "http://testindigo.etsy.com:12345/api/v1"

With changes like these in place, we are able to have much more confidence that our changes won’t break the unattended sledgehammer tool that is so critical for our data center team. This enables us to more effectively refactor the Indigo codebase, whether that be to improve it in general or to make it more testable.

I gave a presentation at OpsSchool, our internal series of lectures designed to educate people on a variety of operations-related topics inspired by opsschool.org, on how to change the Indigo code to make it better suited to unit testing. Unit testing itself is beyond the scope of this post, but for us, this has meant things like changing method signatures so that objects that might be mocked or stubbed out can be passed in during tests, or splitting up large gnarly methods that grew organically along with the Indigo codebase over the past few years into smaller, more testable pieces. This way, other people on the team are able to help write unit tests for all of Indigo’s shared library code as well.

Deploying, Monitoring, and Planning

As mentioned previously, one of the biggest headaches with this tooling had been keeping all the different moving pieces in sync when people were making changes. To fix this, we decided to leverage the work that had already been put into Deployinator by our dev tools team. We created an Indigo deployinator stack that, among other things, ensures that the shared libraries, API, command line tools, and sledgehammer payload are all deployed at the same time. It keeps these deploys in sync, handles the building of the payload RPM, and restarts all the Indigo services to make sure that we never again run into issues where the payload stops working because it didn’t get updated when one of its shared library files did or vice versa.

Deploying the various components of Indigo with Deployinator

Additionally, it automatically emails release notes to everyone who uses the Indigo toolset, including our data center team. These release notes, generated from the git commit logs for all the commits being pushed out with a given deploy, provide some much-needed visibility into how the tools are changing. Of course, this meant making sure everyone was on the same page with writing commit messages that will be useful in this context! This way the data center folks, geographically removed from the ops team making these changes, have a heads up when things might be changing with the tools they use.

Email showing release notes for Indigo generated by Deployinator

Finally, we’re changing how we approach the continued development and maintenance of this software going forward. Indigo started out as a single ruby script and evolved into a complex interconnected set of tools, but for a while the in-depth knowledge of all the tools and their interconnections existed solely in the heads of a couple people. Going forward, we’re documenting not only how to use the tools but how to develop and test them, and encouraging more members of the team to get involved with this work to avoid having any individuals be single points of knowledge. We’re keeping testability in mind as we write more code, so that we don’t end up with any more code that has to be refactored before it can even be tested. And we’re developing with an eye for the future, planning what features will be added and which bugs are highest priority to fix, and always keeping in mind how the work we do will impact the people who use these tools the most.

Conclusion

Operations engineers don’t think of ourselves as developers, but there’s a lot we can learn from our friends in the development world. Instead of always writing code willy-nilly as needed, we should be planning how to best develop the tooling we use, making sure to be considerate of future-us who will have to maintain and debug this code months or even years down the line.

Tools to provision hardware in a data center need tests and documentation just as much as consumer-facing product code. I’m excited to show that operations engineers can embrace the craftsmanship of software engineering to make our tools more robust and scalable.

3 Comments

Q4 2015 Site Performance Report

Posted by on February 12, 2016 / 2 Comments

Happy New Year! It may be 2016, but we’re here to give you a full review of the site performance highlights for the fourth quarter of 2015. For this report, we’ve collected data from a full week in December that we will be comparing to the full week of data from September we gathered for our last report. Unlike the Q3 report, we did not uncover any prevailing trends that spanned every section of this report. In the fourth quarter, we saw a few improvements and a handful of regressions, but the majority of pages remained stable.

As in the past, this report is a collaborative effort. Mike Adler will discuss our server side performance, Kristyn Reith will report on the synthetic front-end data, and the real user monitoring update will come from Allison McKnight and Moishe Lettvin. So without further delay, let’s take a look at the numbers.

Server-Side Performance

We begin our report with the server-side latencies, which is how long it takes for our servers to build pages. This metric does not include any browser-side time. We calculate it by taking a random sample of our web server logs. One reason we start here is that changes in server-side metrics can explain some changes in synthetic and RUM metrics. As you can see below, most pages are executing on the server at about the same speed, though the cart page did get about 27% faster on average.

Q4_2015_server_side

The fourth quarter of the year obviously includes holiday shopping, which is the busiest time of year for our site. Each year our ops team plans to add capacity in anticipation of the rush, but still, it’s not unusual to discover new performance bottlenecks as traffic ramps up. Each year we have new code, new hardware and new usage patterns. To quote a translation of Heraclitus, “Ever-newer waters flow on those who step into the same rivers.”

In this quarter we discovered that we could improve our efficiency by running numad on our web servers. Our web servers do not use any virtualization, so one kernel is scheduling across 24 physical cores (hyper-threading disabled, for now). We noticed that some cores were significantly busier than others and that was effectively limiting throughput and increasing latency. An Etsy engineer learned that by simply running numad, the cpu workload was more balanced, leading to better efficiency. In short, our server-side metrics no longer slowed down during busy times of the day.

Today’s servers are built with NUMA (Non-uniform memory access) architectures, which creates an incentive to schedule a tasks on CPUs that are “close” to the memory they need. Depending on many factors (hardware, workload, other settings), this scheduling challenge can result in suboptimal efficiency. We found that numad, a userland daemon that assigns processes to numa zones, is a simple and effective way to optimize for our current conditions.

We saw that our search page performance got a little slower on average, but we expected this due to launching some more computationally expensive products (such as our Improved Category Pages).

Synthetic Start Render

For our synthetic testing, we’ve set up tests using a third party service which simulates actions taken by a user and then automatically reloads the test pages every ten minutes to generate the performance metrics. As mentioned in the last report, due to recent product improvements, we have decided to retire “Webpage Response” as a metric used for this report, so we will be focusing on the “Start Render” metric in the boxplots below. Overall, we did not see any major changes in start render times this quarter that would have impacted user experience.

Q4 Synthetic RawData

You may notice that the metrics for Q3 differ from the last report and that the start render times have significantly wider ranges. In the last two reports the plotted data points have represented median measurements, which limited our analysis to the median metrics. In this report, the boxplots are constructed using raw data, thereby providing a more accurate and holistic representation of each page’s performance. Since the raw data captures the full distribution of start render times, we are seeing more outliers than we have in the past.

As you can see above, the start render time has remained fairly stable this quarter. Though nearly all the pages experienced a very minor increase in median start render time, the homepage saw the greatest impact. This slowdown can be attributed to the introduction of a new font. Though this change added additional font bytes to all the pages, we discovered that the scope of the regression significantly varied depending on the browser. The data displayed in the boxplots is from tests run in IE9. The graphs below show the days surrounding the font change.

Homepage-Render Start-Browser Difference

While the font change resulted in a noticeable jump in start render time in IE, start render performance in Chrome remained relatively unaffected . The difference in font file formats displayed by the browsers is partially responsible for the disparity in performance. The font format selected by Chrome (woff2) uses a compression algorithm that reduces the font file by roughly 30%, resulting in a substantially smaller file size when compared to other formats. Additionally, the IE browser running the synthetic test has the compatibility view enabled, meaning that although it’s effectively using IE9, the browser is rendering the pages with an even older version of IE. Therefore, the browser is downloading all the font files included in the pages corresponding CSS file regardless of whether or not they are used on the page.

Since Chrome is more commonly used among Etsy users than IE9, we have set up new synthetic tests in Chrome. For all future reports we will use the data generated by the new Chrome tests to populate the synthetic boxplots. We feel that this change, coupled with the continued use of raw data will provide a more realistic representation of what our users experience.

Real User Page Load Time

“Real User” data can show us variations in the actual speed that our users experience, depending on their geographic location, what browser they’re using, the internet provider they’re using, and so on. Sometimes the richness of this data lets us see trends that we couldn’t see in our other two monitoring methods; other times, the variety of data makes trends harder to see.

Q4_2015_RUM_new2

This quarter, we mostly saw small or no change in our metrics. The only delta that’s noticeable is the fifth percentile in the Shop Home page, which increased from 2.3 seconds to 2.6 seconds. A time series graph of the fifth percentile shows an uptick corresponding to our new page font. This parallels what we found in the synthetic tests, as discussed above.

Because shop_home is one of our most-consistently fast pages, it tends to be more sensitive to changes in overall site load time. That is, it shows deltas that might get “lost in the noise” on pages with higher variance.

With this context, it can be interesting to look at the data day-by-day in addition to the week vs. week comparison that the box plot above shows us too. You can see below that even with the fairly large difference in the fifth percentile seen in the box plot, on the last day of the comparison weeks the slower and faster lines actually trade positions.

Q4_RUM_Graph

Conclusion

Despite the fourth quarter being the busiest time of the year, users did not experience degraded performance and most of the fluctuations in page speed were negligible. For this report we improved the data quality in the synthetic section by switching from median measurements to raw data. In the coming 2016 reports, we will strive to make more improvements to our methodology. If you have any thoughts or questions, please leave a comment!

2 Comments

Quality Matters: The Benefits of QA-Focused Retros

Posted by on February 8, 2016 / 8 Comments

Retrospectives (retros) are meetings that take place after every major project milestone to help product teams at Etsy improve processes and outcomes in the next iteration of a project. The insights gained from project retros are invaluable to proactively mitigating problems in future projects while promoting continuous learning.

I am one of the managers on the Product Quality (PQ) team, which is Etsy’s centralized resource for manual QA. When manual QA was first introduced at Etsy, testers joined a team for the duration of the project but had limited opportunity  to get objective feedback. Consequently, testers were kept from seeing the importance of their contributions and from viewing themselves as true members of the product team they worked with. This lack of communication and continuous learning left our testers with less job satisfaction and feelings of empowerment than their peers in product engineering.

We decided to try QA-focused retros to surface feedback that would help us identify repeatable behaviors that contribute to consistently successful product launches. We were also interested in empowering Product Quality team members to understand how their specific contributions impact product launches and allow them to take more ownership of their responsibilities.

Regularly scheduled QA retros have helped to promote mindfulness and accountability on the PQ team. Over time, they have solidified relationships with product owners, designers, and engineers and reinforced the sense that we are all working toward the same goal.

Here’s some information to help you run a QA-focused retro:

Meeting Goals

Sample Agenda

How Does It Work?

The retro should be scheduled after the initial product launch or any milestone that required a significant QA effort. Participants should receive the agenda ahead of time and be asked to come prepared with their thoughts on the main questions.  The QA engineer who led the testing effort should facilitate the retro by prompting attendees, in a round robin fashion, to weigh in on each agenda item.

The facilitator also participates by giving insights to the product team they partnered with for launch. This 30-minute activity is best if scheduled near the end of the day and can be augmented with snacks and beverages to create a relaxed atmosphere. The facilitator should record insights and report any interesting findings back to the QA team and the product team they worked with on the project.

Who Should Attend?

Participants should be limited to those who directly interacted with QA during development. These are usually:

What Happens Next?

Everyone on the Product Quality team reviews the takeaways from the retro and clarifies any questions with those who participated in the retro. We then make the applicable adjustments to processes prior to the next iteration of a project. All changes made to the PQ process are communicated to team members and product teams as needed.

Conclusions

QA-focused retros have empowered our team to approach testing as a craft that is constantly honed and developed. The meetings help product managers learn how to get the most out of working with the QA team and provide opportunities for product teams to participate in streamlining the QA process.

8 Comments

Leveling Up Your Organization With System Reviews

Posted by on December 21, 2015 / 1 Comment

One of Etsy’s core engineering philosophies has been to push decisions to the edges wherever possible. Rather than making dictatorial style decisions we enable people to make their own choices given the feedback of a group. We use the system review format to tackle organizational challenges that surface from a quickly growing company. A system review is a model that enables a large organization to take action on organizational and cultural issues before they become larger problems. They’re modeled after shared governance meetings – a disciplined yet inclusive format that enables organizational progress and high levels of transparency. This form of leadership values continued hard work, open communication, trust and respect.

This idea was introduced by our Learning and Development team, who among other things run our manager training programs, our feedback system, and our dens program (a vehicle for confidential, small group discussions about leadership and challenges at work). A few years ago, we started bringing the engineering leadership group together on a recurring basis, but the agenda and outcome of these meetings were unclear. We always had something to talk about and discuss, but it was difficult to move forward and address issues. We were looking for something that was better facilitated, and for ways to bring our engineering leadership team together to provide the clear outcome of helping solve some of our organizational challenges. L&D provided us with facilitation training and an overview of different meeting formats to use in different situations. Over time we’ve tested out some of these new meeting types, and the system review is one of the many formats we’ve learned to apply. They coached us through facilitating the first series of these new formats in our meetings over the the first several months.  We’re extremely fortunate to have a team of smart, focused individuals that have the background in providing solutions for these types of problems. We are sharing these insights here for the benefit of anyone interested in the topic.

These meetings can work well for anything from small groups of 20 up to large groups of 300 people. They may take a few times to get the hang of, but once you get into a rhythm it becomes an efficient format to survey a large group and take action on important issues. They should be held at a regular cadence, such as monthly or quarterly such that it creates a feedback loop of proactively raising new problems to address, while reporting findings and potential solutions to previously discussed topics.

When we are reviewing system issues we are looking into the following things:

System review meetings are based around a specific prompt taken from these areas, for example “In what area of the engineering org do you feel there is currently a high degree of confusion or frustration?”.

What does the agenda look like?

This type of meeting needs to be timed and facilitated. The agenda is pretty straightforward, it’s important that the group observe the timer that the facilitator maintains and respect that they are moving the meeting along within the confines of a one-hour timeframe. Facilitation is really an art in itself, and there are a lot of resources out there to help with improving the technique. Generally the facilitator should not be emotionally invested in the topic so they can remove themselves from the conversation and focus on moving the meeting forward. They should set and communicate the agenda, and make sure the room is set up to support the format and that the appropriate materials are provided. They should set the stage for the meeting – let people know why they are there, an overview of what they’ll be doing, and why it’s important. They should manage the agenda and timing. The timing is somewhat flexible within the greater time frame and should be adjusted as necessary based on the discussions taking place. It’s possible that a conversation is deeper than a time slot allows but the facilitator decides on the fly that it is important enough to cut time from another part of the meeting. However, every time the timer is ignored, the group slides away from healthy order and towards bad meeting anarchy, so it’s the facilitator’s job to keep this in check. To do this effectively, the facilitator needs to be comfortable interrupting the group to keep the meeting on track. And lastly the facilitator should close the meeting on time, summarize any key actions, agreements and next steps.

Below is the agenda for the high level format of the meeting. After presenting the prompt chosen for the meeting, the facilitator should divide the attendees into groups of approximately five people. Each member of the group individually generates ideas about it (sticky notes work great to collect these, one idea per sticky). Within these small groups, everyone shares their top two issues and the impact they feel each has. After everyone has shared theirs, the group should vote and tally the top three issues.  

The facilitator will then have everyone come back together as the larger group. Each of the subgroups will share the three things that they upvoted. After each round the larger group can ask clarifying questions. It’s a good idea if the facilitator maintains a spreadsheet with all of the ideas so that everyone can refer back to it. It comes in handy because the next phase is for everyone to vote on the issues. Take the top 3 – 5 issues as something to move forward on investigating.

Sample Agenda

Prompt:  In what area of the engineering org do you feel there is currently a high degree of confusion or frustration?

Small Groups (20 mins timed by facilitator)

  1. Solo brainstorm (2 mins full group)
  2. Round-robin: share top 2 issues + their impact (2 mins per person)
  3. Group vote: vote and tally top three (5 mins)

Full Group (30 mins)

  1. Each group shares their three upvoted (2 mins per group)
  2. Clarifying Questions asked (3 mins per round)
  3. Full vote: Write 3 votes on post-its (3 mins)
  4. Drivers volunteer

Next Steps

After we have settled on the top issues, we need people in the group that are interested in working on investigating and bringing information back to the group at a later date. Hopefully at least one person is passionate enough about each topic to look into it further, or else it would not have been voted a top issue. Create a spreadsheet to maintain each topic, driver and the due date they propose to bring information back to the group.

Each of the drivers should report back on these questions to help the organization begin to understand the issue, report back on the answers they’ve acquired and decide on the next steps. This follow up can happen at the beginning of future meetings.

Conclusion

System reviews are just one format that we can use to build communication, respect and trust across our team and organization. The purpose can be to surface possible glitches in the system, but also to achieve alignment on what the most important problems the group should spend their energy solving and to reach clarity around them. They can also be used to feed a topic into another format called the Decisions Roundtable, which is a similar type of meeting used to drive forward a proposal to make a change. Similar to post mortems and retrospectives, system reviews can be used to level up the organization and foster a learning culture. Some topics that we’ve explored in the past have been around how we think about hiring managers, why we deviated to using different tools to plan work, how we document things and where that information should live, clarity around career paths, and how we can address diversity and inclusivity in tech management. System reviews are used to help explore topics that may be difficult for some of the people in the group, but as long as the process is handled sensitively, we all come out with better understanding, more empathy for the experiences of others and a stronger organization as a whole.

1 Comment

Introducing Arbiter: A Utility for Generating Oozie Workflows

Posted by on December 16, 2015 / 8 Comments

At Etsy we have been using Apache Oozie for managing our production workflows on Hadoop for several years. We’ve even recently started using Oozie for managing our ad hoc Hadoop jobs as well. Oozie has worked very well for us and we currently have several dozen distinct workflows running in production. However, writing these workflows by hand has been a pain point for us. To address this, we have created Arbiter, a utility for generating Oozie workflows.

The Problem

Oozie workflows are specified in XML. The Oozie documentation has an extensive overview of writing workflows, but there are a few things that are helpful to know. A workflow begins with the start node:

<start to="fork-2"/>

Each job or other task in the workflow is an action node within a workflow. There are some built-in actions for running MapReduce jobs, standard Java main classes, etc. and you can also define custom action types. This is an example action:

<action name="transactional_lifecycle_email_stats">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>transactional_lifecycle_email_stats.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-2"/>
    <error to="join-2"/>
</action>

Each action defines a transition to take upon success and a (possibly) different transition to take upon failure:

<ok to="join-2"/>
<error to="join-2"/>

To have actions run in parallel, a fork node can be used. All the actions specified in the fork will be run in parallel:

<fork name="fork-2">
    <path start="fork-0"/>
    <path start="transactional_lifecycle_email_stats"/>
</fork>

After these actions there must be a join node to wait for all the forked actions to finish:

<join name="join-2" to="screamapillar"/>

Finally, a workflow ends by transitioning to either the end or kill nodes, for a successful or unsuccessful result, respectively:

<kill name="kill">
    <message>Workflow email-rollups has failed with msg: [${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>

Here is a complete example of one of our shorter workflows:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="email-rollups">
  <start to="fork-2"/>
  <fork name="fork-2">
    <path start="fork-0"/>
    <path start="transactional_lifecycle_email_stats"/>
  </fork>
  <action name="transactional_lifecycle_email_stats">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>transactional_lifecycle_email_stats.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-2"/>
    <error to="join-2"/>
  </action>
  <join name="join-2" to="screamapillar"/>
  <action name="screamapillar">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>${queueName}</value>
        </property>
        <property>
          <name>mapreduce.map.output.compress</name>
          <value>true</value>
        </property>
      </configuration>
      <main-class>com.etsy.oozie.Screamapillar</main-class>
      <arg>--workflow-id</arg>
      <arg>${wf:id()}</arg>
      <arg>--recipient</arg>
      <arg>fake_email</arg>
      <arg>--sender</arg>
      <arg>fake_email</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="end"/>
    <error to="kill"/>
  </action>
  <fork name="fork-0">
    <path start="email_campaign_stats"/>
    <path start="user_language"/>
  </fork>
  <action name="user_language">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>user_language.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-0"/>
    <error to="join-0"/>
  </action>
  <join name="join-0" to="fork-1"/>
  <fork name="fork-1">
    <path start="email_overview"/>
    <path start="trans_email_overview"/>
  </fork>
  <action name="trans_email_overview">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>trans_email_overview.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-1"/>
    <error to="join-1"/>
  </action>
  <join name="join-1" to="join-2"/>
  <action name="email_overview">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>zz_email_overview.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-1"/>
    <error to="join-1"/>
  </action>
  <action name="email_campaign_stats">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>zz_email_campaign_stats.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-0"/>
    <error to="join-0"/>
  </action>
  <kill name="kill">
    <message>Workflow email-rollups has failed with msg: [${wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>
  <end name="end"/>
</workflow-app>

Having the workflows be defined in XML has been very helpful. We have several validation and visualization tools in multiple languages that can parse the XML and produce useful results without being tightly coupled to Oozie itself. However, the XML is not as useful for the people that work with it. First, it is very verbose. Each new action adds about 20 lines of XML to the workflow, much of which is boilerplate. As a result, our workflows average around 200 lines and the largest is almost 1800 lines long. This also makes it hard for someone to read the workflow and understand what the workflow does and the flow of execution.

Next, defining the flow of execution can be tricky. It is natural to think about the dependencies between actions. Oozie workflows, however, are not specified in terms of these dependencies. The workflow author must satisfy these dependencies by configuring the workflow to run the actions in the proper order. For simple workflows this may not be a problem, but can quickly become complex. Moreover, the author must manually manage parallelism by inserting forks and joins. This makes modifying the workflow more complex. We have found that it’s easy to miss adding an action to a fork, resulting in an orphaned action that doesn’t get run. Another common problem we’ve had with forks is that a single-action fork is considered invalid by Oozie, which means removing the second-last action from a fork requires removing the fork and join entirely.

Introducing Arbiter

Arbiter was created to solve these problems. XML is very amenable to being produced automatically, so there is the opportunity to write the workflows in another format and produce the final workflow definition. We considered several options, but ultimately settled on YAML. There are robust YAML parsers in many languages and we considered it easier for people to read than JSON. We also considered a Scala-based DSL, but we wanted to stick with a markup language for language-agnostic parsing.

Writing Workflows

Here is the same example workflow from above written in Arbiter’s YAML format:

---
name: email-rollups
errorHandler:
  name: screamapillar
  type: screamapillar
  recipients: fake_email
  sender: fake_email
actions:
  - name: email_campaign_stats
    type: rollup
    rollup_file: zz_email_campaign_stats.sql
    category: regular
    dependencies: []
  - name: trans_email_overview
    type: rollup
    rollup_file: trans_email_overview.sql
    category: regular
    dependencies: [email_campaign_stats, user_language]
  - name: email_overview
    type: rollup
    rollup_file: zz_email_overview.sql
    category: regular
    dependencies: [email_campaign_stats, user_language]
  - name: user_language
    type: rollup
    rollup_file: user_language.sql
    category: regular
    dependencies: []
  - name: transactional_lifecycle_email_stats
    type: rollup
    rollup_file: transactional_lifecycle_email_stats.sql
    category: regular
    dependencies: []

The translation of the YAML to XML is highly dependent on the configuration given to Arbiter, which we will cover in the next section. However, there are several points to consider now. First, the YAML definition is only about 20% of the length of the XML. Since the workflow definition is much shorter, it’s easier for someone to read it and understand what the workflow does. In addition, none of the flow control nodes need to be manually specified. Arbiter will insert the start, end, and kill nodes in the correct locations. Forks and joins will also be inserted when actions can be run in parallel.

Most importantly, however, the workflow author can directly specify the dependencies between actions, instead of the order of execution. Arbiter will handle ordering the actions in such a way to satisfy all the given dependencies.

In addition to the standard workflow actions, Arbiter allows you to define an “error handler” action. It will automatically insert this action before any transitions to the end or kill nodes in the workflow. We use this to send an email alert with details about the success and failure of the workflow actions. If these are omitted the workflow will transition directly to the end or kill nodes as appropriate.

Configuration

The mapping between a YAML workflow definition and the final XML is controlled by configuration files. These are also specified in YAML. Here is an example configuration file to accompany the example workflow given above:

---
killName: kill
killMessage: "Workflow $$name$$ has failed with msg: [${wf:errorMessage(wf:lastErrorNode())}]"
actionTypes:
  - tag: java
    name: rollup
    configurationPosition: 2
    properties: {"mapreduce.job.queuename": "rollups"}
    defaultArgs: {
      job-tracker: ["${jobTracker}"],
      name-node: ["${nameNode}"],
      main-class: ["com.etsy.db.VerticaRollupRunner"],
      arg: ["--file", "$$rollup_file$$", "--frequency", "daily", "--category", "$$category$$", "--env", "${cluster_env}"]
    }
  - tag: sub-workflow
    name: sub-workflow
    defaultArgs: {
      app-path: ["$$workflowPath$$"],
      propagate-configuration: []
    }
  - tag: java
    name: screamapillar
    configurationPosition: 2
    properties: {"mapreduce.job.queuename": "${queueName}", "mapreduce.map.output.compress": "true"}
    defaultArgs: {
      job-tracker: ["${jobTracker}"],
      name-node: ["${nameNode}"],
      main-class: ["com.etsy.oozie.Screamapillar"],
      arg: ["--workflow-id", "${wf:id()}", "--recipient", "$$recipients$$", "--sender", "$$sender$$", "--env", "${cluster_env}"]
    }

The key part of the configuration file is the actionTypes setting. Each action type will map to a certain action type in the XML workflow. However, multiple Arbiter action types can map to the same Oozie action type, such as the screamapillar and rollup action types both mapping to the Oozie java action type. This allows you to have meaningful action types in the YAML workflow definitions without the overhead of actually creating custom Oozie action types. Let’s review the parts of an action type definition:

  - tag: java
    name: rollup
    configurationPosition: 2
    properties: {"mapreduce.job.queuename": "rollups"}
    defaultArgs: {
      job-tracker: ["${jobTracker}"],
      name-node: ["${nameNode}"],
      main-class: ["com.etsy.db.VerticaRollupRunner"],
      arg: ["--file", "$$rollup_file$$", "--frequency", "daily", "--category", "$$category$$", "--env", "${cluster_env}"]
    }

The tag key defines the action type tag in the workflow XML. This can be one of the built-in action types like java, or a custom Oozie action type. Arbiter does not need to be made aware of custom Oozie action types. The name key defines the name of this action type, which will be used to set the type of actions in the workflow definition. If the Oozie action type accepts configuration properties from the workflow XML, these are controlled by the configurationPosition and properties keys. properties defines the actual configuration properties that will be applied to every action of this type, and configurationPosition defines where in the generated XML for the action the configuration tag should be placed. The defaultArgs key defines the default elements of the generated XML for actions of this type. The keys are the names of the XML tags, and the values are lists of the values for that tag. Even tags that can appear only once must be specified as a list.

You can also define properties to be populated from values set in the workflow definition. Any string surrounded by $$ will be interpolated in this way. $$rollup_file$$ and $$category$$ are examples of doing so in this configuration file. These will be populated with the values of the rollup_file and category keys from a rollup action in the workflow definition.

Using this configuration file, we could write an action like the following in the YAML workflow definition:

  - name: email_campaign_stats
    type: rollup
    rollup_file: zz_email_campaign_stats.sql
    category: regular
    dependencies: []

Arbiter would then translate this action to the following XML:

<action name="email_campaign_stats">
    <java>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property>
          <name>mapreduce.job.queuename</name>
          <value>rollups</value>
        </property>
      </configuration>
      <main-class>com.etsy.db.VerticaRollupRunner</main-class>
      <arg>--file</arg>
      <arg>zz_email_campaign_stats.sql</arg>
      <arg>--frequency</arg>
      <arg>daily</arg>
      <arg>--category</arg>
      <arg>regular</arg>
      <arg>--env</arg>
      <arg>${cluster_env}</arg>
    </java>
    <ok to="join-0"/>
    <error to="join-0"/>
</action>

Arbiter also allows you to specify the name of the kill node and the message it logs with the killName and killMessage properties.

How Arbiter Generates Workflows

Arbiter builds a directed graph of all the actions from the workflow definition it is processing. The vertices of the graph are the actions and the edges are dependencies. The direction of the edge is from the dependency to the dependent action to represent the desired flow of execution. Oozie workflows are required to be acyclic, so if a cycle is detected Arbiter will throw an exception.

The directed graph that Arbiter builds will be made up of one or more weakly connected components. This is the graph from the example workflow above, which has two such components:
An example of a graph that is input to Arbiter
Each of these components is processed independently. First, any vertices with no incoming edges are removed from the graph and inserted into a new result graph. If there is more than one vertex removed Arbiter will also insert a fork/join pair to run them in parallel. Having removed those vertices, the original component will now have been split into one or more new weakly connected components. Each of these components is then recursively processed in this same way.

Once every component has been processed, Arbiter then combines these independent components until it has produced a complete graph. Since these components were initially not connected, they can be run in parallel. If there is more than component, Arbiter will insert a fork/join pair. This results in the following graph for the example workflow, showing the ok transitions between nodes:

An example workflow graph produced by Arbiter
This algorithm biases Arbiter towards satisfying the dependencies between actions over achieving optimal parallelism. In general this algorithm still produces good parallelism, but in certain cases (such as a workflow with one action that depends on every other action), it can degenerate to a fairly linear flow. While it is a conservative choice, this algorithm has still worked out well for most of our workflows and has the advantage of being straightforward to follow in case the generated workflow is incorrect or unusual.

Once this process has finished all the flow control nodes will be present in the workflow graph. Arbiter can then translate this into the XML using the provided configuration files.

Get Arbiter

Arbiter is now available on Github! We’ve been using Arbiter internally already and it’s been very useful for us. If you’re using Oozie we hope Arbiter will be similarly useful for you and welcome any feedback or contributions you have!

8 Comments

Crunching Apple Pay tokens in PHP

Posted by on November 20, 2015 / 2 Comments

Etsy is always trying to make it easier for members to buy and sell unique goods. And with 60% of our traffic now coming from mobile devices, making it easier to buy things on phones and tablets is a top priority. So when Apple Pay launched last year, we knew right away we wanted to offer it for our iOS users, and shipped it in April. Today we’re open sourcing part of our server-side solution, applepay-php, a PHP extension that verifies and decrypts Apple Pay payment tokens.

Integrating with Apple Pay comes down to two main areas of development: device side and payment-processing side. On the device side, at a high level, your app uses the PassKit framework to obtain an encrypted payment token which represents a user’s credit card info. On the payment-processing side, the goal is to make funds move between bank accounts. The first step here is to decrypt the payment token.

Many payment processors offer APIs to decrypt Apple Pay tokens on your behalf, but in our case, we wanted the flexibility of reading the tokens in-house. It turns out that doing this properly is pretty involved (to get an idea of the complexity, our solution defines 63 unique error codes), so we set out to find a pre-existing solution. Our search yielded a couple of open source projects, but none that fully complied with Apple’s spec. Notably, we couldn’t find any examples of verifying the chain of trust between Apple’s root CA and the payment signature, a critical component in guarding against forged payment tokens. We also couldn’t find any examples written in PHP (our primary language) or C (which could serve as the basis for a PHP extension). To meet our needs, we wrote a custom PHP extension on top of OpenSSL that exposes just two functions: applepay_verify_and_decrypt and applepay_last_error. This solution has worked really well for us over the past six months, so we figured we’d share it to make life easier for anyone else in a similar position.

Before releasing the code, we asked Syndis, a security consultancy based out of Iceland, to perform an external code review in addition to our everyday in-house code reviews. Syndis surveyed the code for both design flaws and implementation flaws. They found a few minor bugs but no actual vulnerabilities. Knowing that we wouldn’t be exposing users to undue risk gave us greater confidence to publish the code.

We’ve committed to using the open source version internally to avoid divergence, so expect to see future development on Github. Future work includes splitting off a generalized libapplepay (making it easier to write wrapper libraries for other languages), PHP7 compatibility, and an HHVM port. (By the way, if any of this sounds fun to you, we’d love for you to come work with us.)

We hope this release provides merchants with a solid solution for handling Apple Pay tokens. We also hope it inspires other organizations to consider open sourcing parts of their payment infrastructure.

You can follow Adam on Github @adsr.

Special thanks to Stephen Buckley, Keyur Govande, and Rasmus Lerdorf.

2 Comments

Q3 2015 Site Performance Report

Posted by on November 10, 2015 / No Responses

Sadly, the summer has come to an end here in Brooklyn, but the changing of the leaves signifies one thing—it’s time to release our Q3 site performance report! For this report, we’ve collected data from a full week in September that we will be comparing to a full week of data from May. Similar to last quarter’s report, we will be using box plots to better visualize the data and the changes we’ve seen.

While we love to share stories of our wins, we find it equally important to report on the challenges we face. The prevailing pattern you will notice across all sections of this report is increased latency. Kristyn Reith will provide an update on backend server-side performance and Mike Adler, one of the newest members to the Performance team, will be reporting the synthetic frontend and the real user monitoring sections of this report.

Server-Side Performance

The server-side data below reflects the time seen by real users, both signed-in and signed-out. As a reminder, we are randomly sampling our data for all pages during the specified weeks in each quarter.

server-side
You can see that with the exception of the homepage, all of our pages have gotten slower on the backend. The performance team kicked off this quarter by hosting a post mortem for a site-wide performance degradation that occurred at the end of Q2. At that time, we had migrated a portion of our web servers to new, faster hardware, however the way the workload was initially distributed was overworking the old hardware, leading to poor performance for the 95th percentile. Increasing the weighting of the new hardware in the loadbalancer helped mitigate this. While medians did not see a significant impact over the course of the hardware change, it caused higher highs and lower lows for the 95th percentile. As a heavier page, the signed-in homepage saw the greatest improvement once the weights were adjusted, which contributed to its overall improvement this quarter. Other significant causes for the changes seen on the server side can be attributed to two new initiatives that were launched this quarter, Project Arizona and Category Navigation.

Arizona is a read-only key / value system to serve product recommendations and other generated datasets on a massive scale. It replaces a previous system that we had outgrown that stored all data in-memory; Arizona instead uses SSDs to allow for more and varied datasets. This quarter we launched the first phase of the project that resulted in some expected performance regressions compared with the previous memory-backed system. The first phase focused on correctness, ensuring data remained consistent between the two systems. Future phases will focus on optimizing speed of lookups to be comparable to the previous system while offering much greater scalability and availability.

In the beginning of August, our checkout team noticed two separate regressions on the cart page that had occurred over the course of the prior month. We had not been alerted on these slowdowns because at the end of Q2, the checkout team had launched cart pagination which improved the performance of the cart page by limiting the number of items loaded and we had not adjusted the thresholds to match this new normal. Luckily, the checkout team noticed the change in performance and we were able to trace the cause back to testing for Arizona.

While in the midst of testing for Arizona, we also launched a new site navigation bar that is included under the search bar on every page and features eight of the main shopping categories. Not only does the navigation bar make it easier for shoppers to find items on the site, but we also believe that the new navigation will positively affect Search Engine Optimization, driving more traffic to shops. While testing the feature we noticed some performance impacts so when the feature launched at the end of August, we were closely watching as we expected a performance degradation due to the amount of the HTML being generated. The performance impact was felt across the majority of our pages though it was more noticeable on some pages than others depending on the weight of the page. For example, lighter pages such as baseline appear harder hit because the navigation bar accounts for a significant amount of the page’s overall weight.

In an awesome win, in response to the anticipated performance hit, the buyer experience engineering team ramped up client side rendering for this new feature, which cut down the rendering time on buyer side pages by caching the HTML output and shipping less to the client.

In addition to the hardware change, Project Arizona and the new site navigation feature, we also have been investigating a slow, gradual regression we noticed across several pages that began in the first half of Q3. Extensive investigation and testing revealed that the regression was the result of limited CPU resources. We are currently adding additional CPU capacity and anticipate the affected pages will get faster in this current quarter.

Synthetic Start Render

Let’s move on to our synthetic tests where we have instrumented browsers load pages automatically every 10 minutes from several locations. This expands the scope of analysis to include browser-side measurements along with server-side. The strength of synthetic measurements is that we can get consistent, highly-detailed metrics about typical browser scenarios. We can look at “start render” to estimate when most people first see our pages loading.

Synthetic Start Render
The predominant observation is that our median render-start times across most pages has increased about 300ms compared to last quarter. You might expect a performance team to feel bummed out about a distinctly slower result, but we actually care about more about the overall user experience than just page speed measurements on any given week. The goal of our Performance team is not just to make a fast site, but to encourage discussions that accurately consider performance as one important concern among several.

This particular slowdown was caused by broader use of our new css toolkit, which adds 35k of CSS to every page. We expect the toolkit to be a net-win eventually, but we have to pay a temporary penalty while we work on eliminating non-standard styles. Several teams gathered together to discuss the impact of this change, which gave us confidence that Etsy’s culture of performance is continuing to mature, despite this particular batch of measurements.

The median render-start time for our search page appears to have increased by 800ms, following a similar degradation in the last quarter, but we found this to be misleading. We isolated this problem to IE browsers versions 10 and older, which actually represents a tiny fraction of Etsy users. The search page renders much faster (around 1100ms) in Chrome (far more popular), which is consistent with all our other pages across IE and Chrome.

Synthetic checks are vulnerable to this type of misleading measurement because it’s really difficult to build comprehensive labs that match the true diversity of browsers in the wild. RUM measurements are better suited to that task. We are currently discussing how to improve the browsers we use in our synthetic tests.

What was once a convenient metric for estimating experience may eventually become less meaningful as one fundamentally changes the way a site is loaded. We feel it is important to adapt our monitoring to the new realities of our product. We always want to be aligned with our product teams, helping them build the best experience, rather than spending precious time optimizing for metrics that were more useful in the past.

As it happens, we recently made a few product improvements around site navigation (mentioned in the above section). As we optimized the new version, we focused on end-user experience and it became clear that ‘Webpage Response’ was becoming less and less connected to end-user experience. WR includes the time for ALL assets loaded on the page, even if these requests are hidden from the end-user, such as deferred beacons.

We are evaluating alternative ways to estimate end-user experience in the future.

Real User Page Load Time

Real user monitoring give us insight into actual page loads experienced by end-users. Notably, it accounts for real-world diversity of network conditions, browser versions, and internationalization.

RUM
We can see across-the-board increases, which is in line with our other types of measurements. By looking at the daily summaries of these numbers, we confirmed that the RUM metrics regressed when we launched our revamped site navigation (first mentioned in the server-side section). Engineers at Etsy worked to optimize this feature over the next couple weeks and made progress, though one optimization ended up causing a regression on some browsers. This was not exposed except in our RUM data. We have a plan to speed this up during the fourth quarter.

Conclusion

In the third quarter, we had our ups and downs with site performance, due to both product and infrastructure changes. It is important to remember that performance cannot be reduced merely to page speed; it is a balancing act of many factors. Performance is a piece of the overall user experience and we are constantly improving our ability to evaluate performance and make wiser trade-offs to build the best experience. The slowdowns we saw this quarter have only reinforced our commitment to helping our engineering teams monitor and understand the impact of the new features and infrastructure changes they implement. We have several great optimizations and tools in the pipeline and we look forward to sharing the impact of these in the next report.

No Comments