This is the second post in a series of three about Etsy’s API, the abstract interface to our logic and data. The previous post is about concurrency in Etsy’s API infrastructure. This post covers the operational side of the API infrastructure.
Operations: Architecture Implications
How do the decisions for developing Etsy’s API that we discussed in the first post relate to Etsy’s general architecture? We’re all about Do It Yourself at Etsy. A cloud is just other people’s computers, and not in the spirit of DIY; that’s why we rather run our own datacenter with our own hardware.
Also, Etsy kicks it old school and runs on a LAMP stack. Linux, Apache, MySQL, PHP. We’ve already talked about PHP being a strictly sequential, single-threaded, shared-nothing environment, leading to our choice of parallel cURL. In the PHP world, everything runs through a front controller, for example index.php. In that file, we have to include other PHP files if we need them, and to make that easier, we usually use an autoloader to help with dependencies.
Every web request gets a new PHP environment in its own instance of the PHP interpreter. The process of setting up that environment is called bootstrap. This bootstrapping process is a fixed cost in terms of CPU work, regardless of the actual work required by the request. By enabling multiple, concurrent HTTP sub-requests to fetch data for a single client request, this fixed cost was multiplied. Additionally, this concurrency encouraged more work to be done within the same wall clock time. Developers built more diverse features and experiences, but at the cost of using more back-end resources. We had a problem.
Problem: PHP time to request + racking servers D:
As more teams adopted the new approach to build features in our apps and on the web, more and more backend resources were being consumed, primarily in terms of CPU usage from PHP. In response, we added more compute capacity, over time growing the API to four times the number of servers prior to API v3. Continuing down this path we would have exhausted space and power in our datacenters. This was not a long term solution.
To solve this, we tried several strategies at once. First, we skipped some work by allowing to mark some subrequests as optional. This approach was abandoned because people used it as a graceful error recovery mechanism, triggering an alternate subrequest, rather than for optional data fetches. This didn’t help us reduce the overall work required for a given client request.
Also, we spent some time optimizing the bootstrap process. The bootstrap tax is paid by all requests and subrequests, making it a good place to focus our efforts. This initially showed benefit with some low hanging fruit, but it was a moving target in a changing codebase, requiring constant work to maintain a low bootstrap tax. We needed other ways of doing less work.
A big step forward was the introduction of HTTP response caching. We had to add caching quickly, and first tried the same cache we use for image serving, Apache Traffic Server. While being great for caching large image files, it didn’t work as well for smaller, latency sensitive API responses. We settled on Varnish, which is fast and easy to configure for our needs. Not all endpoints are being cached, but for cached ones, Varnish will serve the same response many times. We accept staleness for a small 10 – 15 minute period, drastically reducing the amount of work required for these requests. For the cacheable case, Varnish handles thousands of requests per second with a 80% hit rate. Because the API framework requires input parameters to be explicit in the HTTP request, this meshed well with introducing the caching tier. The framework also transparently handles locale, passing the user’s language, currency and region with every subrequest, which Varnish uses to manage variants.
The biggest step forward came from a courageous experiment. Dan from the core team looked at bigger organizations that faced the same problem, and tried out facebook’s hhvm on our API cluster. And got a rocketship. We could do the same work, but faster, solving this issue for us entirely. The performance gain from hhvm was a catalyst for performance improvements that made it into PHP7. We are now completely switched over to PHP7 everywhere, but it’s unclear what we would have done without hhvm back in the day.
In conclusion, concurrency proved to be great for logical aggregation of components, and not so great for performance optimization. Better database access would be better for that.
Problem: Balancing the load
If we have a tree-like request with sub-requests, a simple solution would be to route this initial request via a load balancer into a pool, and then run all subrequests on the same machine. This leads to a lumpy distribution of work. The next step from here is one uniform pool, and routing the subrequests back into that pool again, to have a good balance across the cluster. Over time (and because we experimented with hhvm), we created three pools that correspond to the three logical tiers of endpoints. In this way, we can monitor and scale each class of endpoints separately, even though each node in all three clusters works the same way.
Where would this not work?
If we sit back and think about this for a bit – how is this architecture specific to Etsy’s ecosystem? Where wouldn’t it work? What are the known problems?
The most obvious gaping hole is that we have no API versioning. How do we even get away with that? We solve this by keeping our public API small and our internal API very very fluid. Since we control both ends of the internal API via client generation and meta-endpoints, the intermediate language of domain objects is free to evolve. It’s tied into our continuous deployment system, moving along with up to 60 deploys per day for etsy.com. And the client is constantly in flux for the internal API.
And as long as it’s internal at Etsy, even the outside layer of bespoke AJAX endpoints is very malleable and matures over time.
Of course this is different for the Apps and the third party API, but those branch off after maturing on the internal API service over several years. Software development companies who focus on an extensive public API or even have that as the main service could not work in this way. They would need an internal place to let the API endpoints mature, which we do on the internal API service that is powering Etsy.
Let’s talk about tooling that we built to learn more about the behavior of our code in practice. Most of the tools that we developed for API v3 are around monitoring the new distributed system.
CrossStitch: Distributed tracing
As we know, with API v2, we had the problem that almost an arbitrary amount of single threaded work could be generated based on the query parameters, and this was really hard to monitor. Moving from the single-threaded execution model to a concurrent model triggering multiple API requests was even more challenging to monitor. You can still profile individual requests with the usual logging and metrics, but it’s hard to get the entire picture. Child requests need to be tied to back to the original request that triggered them, but they themselves might be executed elsewhere on the cluster.
To visualize this, we built a tool for distributed tracing of requests, called CrossStitch. It’s a waterfall diagram of the time spent on different tasks when loading a page, such as HTTP requests, cache queries, database queries, and so on. In darker purple, you can see different HTTP requests being kicked off for a shop’s homepage, for example you see the request for the shop’s about page is running in parallel with requests for other API components.
Fanout Manager: Feedback on fanout limit exhaustion for developers
Bespoke API calls can create HTTP request fanout to concurrent components, which in turn can create fanout to atomic component endpoints. This can create a strain on the API and database servers that is not easy for an endpoint developer to be aware of when building something in the development environment or rolling out a change to a small percentage of users.
The fanout manager aims to put a ceiling on the overall resource requests that are in flight, and we are doing this in the scheduler part of the curl callback orchestrator by keeping track of sub-requests in memcached. When a new request hits the API server, a key based on the unique user ID of that root request is put into memcached. This key works as a counter of parallel in-flight requests for that specific root request. The key is being passed on to the concurrent and component endpoint subrequests. When the scheduler runs a subrequest, it increments the counter for that key. When the request got a response and it’s slot is freed in the scheduler, the counter for the key is decremented. So we always know how many total subrequests are in-flight for one root request at the same time.
In a distributed system like this, multiple requests can be competing for the same slot. We have a problem that requires a lock.
To avoid the lock overhead, we circumvent the distributed locking problem by relying on memcached’s atomic increment and decrement operation. We optimistically first increment the memcached key counter, and then check whether the operation was valid and we actually got the slot. Sometimes we have to decrement again because this optimistic assumption is wrong, but in that case we are waiting for other requests to finish anyway and the extra operation makes no difference.
If an endpoint has too many sub-requests in flight, it just waits before being able to make the next request. This provides a good feedback for our developers about the complexity of the work before the endpoint goes into production. Also, the fanout limit can be hand-tweaked for specific cases in production, where we absolutely need to fetch a lot of data, and a higher number of parallel requests speeds up that fetching process.
Automated documentation of endpoints: datafindr
We also have a tool for automated documentation of new endpoints. It is called datafindr. It shows endpoints and typed resources, and example calls to them, based on a nightly snapshot of the API landscape.
Wanted: Endpoint decommission tool
Writing new endpoints is easy in our framework, but decommissioning existing endpoints is hard. How can we find out whether an existing endpoint is still being used?
Right now we don’t have such a tool, and to decommission an existing endpoint, we have to explicitly log whether that specific endpoint is called, and wait for an uncertain period of time, until we feel confident enough to decide that no one is using it any more. However, in theory it should be possible to develop a tool that monitors which endpoints become inactive, and how long we have to wait to gain a statistically significant confidence of it being out of use and safe to remove.
This is the second post in a series of three about Etsy’s API, the abstract interface to our logic and data. The next post will be published in two weeks, and will cover the adoption process of the new API platform among Etsy’s developers. How do you make an organization switch to a new technology and how did this work in the case of Etsy’s API transformation?
Back in 2014, Etsy started using the ELK (Elasticsearch, Logstash & Kibana) stack. We’ve previously written about how we use saved searches as a reactive security mechanism. When we made the transition to ELK, we noticed there was no way to automatically schedule searches and be notified on the results. Today, we’re introducing our open source solution to this problem: 411.
411 is a query scheduler: it executes saved Elasticsearch queries against your cluster, formats the results, and sends them to you as an alert. Our motivation behind creating 411 was to enable us to easily create customizable alerts to enhance our ability to react to important security events.
As a part of that customizability, 411 gives you lots of options for applying filters to the results that are returned from a search. This includes things like removing duplicate alerts, throttling the number of alerts, as well as the ability to forward the alerts on to other systems like JIRA or a webhook. 411 also provides a robust way to handle responding to multiple alerts as well as an audit log to help keep track of changes to searches and alerts.
In addition to the default search functionality to utilize ELK, we’re also including some additional searches types out the box. HTTP is a lightweight Nagios alternative that you can use to alert on service outages, while Graphite allows you to query a Graphite server and alert when thresholds are exceeded (more information on Etsy’s graphite setup can be found here).
We aimed to make the code as modular as possible to make it easy to extend. We hope that you find the examples we’ve provided to be useful and informative for others to develop their own extended functionality for 411. If you do create new extensions for 411, we more than welcome reviewing your pull request to integrate new functionality into 411!
Etsy’s commitment to Open Source means we use the same version of 411 as what’s available on Github, so you can expect regular project updates. We’ve heard a lot of different ideas regarding different functionality/search types people would like to see based off our talk at Defcon, so we’d appreciate your feedback regarding how you plan on using 411 and features you’d like to see!
This post was written as a collaboration between Ken Lee and Kai Zhong. You can follow Ken on Twitter at @kennysan and you can follow Kai on Twitter at @sixhundredns.
At Etsy we have been doing some pioneering work with our Web APIs. We switched to API-first design, have experimented with concurrency handling in our composition layer, introduced strong typing into our API design, experimented with code generation, and built distributed tracing tools for API as part of this project.
We faced a common challenge: much of our logic was implemented twice. All of the code that was built for the website then had to be rebuilt in our API to be used by our iOS and Android apps.
Problem: repeated logic between platforms
We wanted an approach where we built everything on reusable API components that could be shared between the web and apps. Unfortunately our existing API framework couldn’t support this shared approach. The solution we settled on was to abandon the existing framework and rebuild it from scratch.
Follow along this case study of building an API First architecture, in which functional changes are expressed on the API level before integrating them into the website. Hear what problems prompted this drastic change. Learn which new tools we had to build to be able to work with the new system and what mistakes we made along the way. Finally, how did it end? How did the team adopt the new system and have we succeeded in our goals of API First?
This post will be the first post in a series about our current API infrastructure, which we call version 3. The series is based on a talk at QCon New York. The first post will cover concurrency, the second post will cover operations and the third post the human aspects of our API transition.
If we look into the future, it comes with lots of devices. Mainframes became desktop computers, which became portable laptops and tablets, smart phones and watches.
This trend has been going on for a while, and in order to not reinvent the world on each different device, we started sharing data via an internal API years ago.
The first version of Etsy’s API was a gateway for flash widgets. And the second one was a JSON RESTful API for 3rd parties and internal use. It was tightly coupled to the underlying database schema, and it empowered clients to make customized complex requests. It was so powerful that when we introduced our first iPad App, we did not need to write any new endpoints, and could build it solely on existing ones. Clients could request multiple resources at once, for example request shop data and also include listing data from that shop, and they could specify fields to trim down the response to just the required data. Very powerful.
Second Problem: Performance & complexity control
With great power comes great responsibility, and this approach had some drawbacks. The server code was simple, but we did not know the incoming parameters. We gave the clients control over the complexity of the request via the request parameters. This obviously had implications on server-side performance. And measuring the performance was difficult, because it was not clear if an increased response time was due to the performance of our backend, or because the client requested more resources.
Third Problem: Repetition & inconsistency
Years of changing patterns and an evolving complex codebase with MVC architecture led to bad habits: data fetch during template rendering, and logic in the templates. Our API was for AJAX, whereas the backend code was in PHP. We did not have the logic in one place that was reusable for both the Web and API. This lead to inconsistencies between API and pre-API web.
The schema of the API resource was a snapshot of the data model at the time of exposing it via the endpoint. This one-to-one mapping caused problems with data migrations, as the API resource was “frozen in time”. Should it change with the model? How long should the old resource structure be supported?
Requirements for API-first
We re-discussed the requirements for our API. If performance, manifesting for the user as latency from request to response, was a problem, what was the bottleneck?
First, the time to glass, the time until we see something on our device’s screen, as Ilya Grigorik calls it in his talk “breaking the 1000 milliseconds time to glass”, and he states that due to mobile network speed, we have only 100 milliseconds on the server side if we want to stay in budget. The second problem is that we, at Etsy, come from a sequential-shared-nothing-php-world. No built-in concurrency. How can we parallelize and reuse our work, while still keeping the network footprint low?
API v2: repeated logic between platforms API v3: reusable components
Other requirements were how to think about caching. The previous version of the API was memcached only, caching calls including parameters, which lead to a granularity problem. And one last requirement was to solve the problem starting from what we know and what we’re good at – building our own solutions in PHP.
Shaping our mental model
Based on these learnings, we piece-by-piece architected a new version, called API Version 3. REST resources worked well for both mobile apps and traditional web, so that was a keeper. A new idea was to decouple the endpoints from the framework that hosts them. Minimize the endpoints’ responsibilities to:
- declaring the route
- declaring the input expectations and the output guarantees
- implementing what happens in the endpoint.
.. and that’s about it.
We have one very simple, declarative file for each endpoint.
Everything else is architected away on purpose: StatsD error monitoring, endpoint input and output type checks, and the compilation of the full routes — all of this is handled by the framework. Authentication and access control is also handled there, based on the class of endpoint that the developer has chosen.
Enter the meta-endpoint
We picked up the industry ideas from Netflix and eBay’s ql.io of server side composition of resources into device-view-specific resources. Or in other words: allowing a second layer of endpoints that are consumers of our own API, requesting and aggregating other endpoints. This means the server itself is also a client of the API, making the server more complex, while giving it more control with an extra layer for code execution. This improves performance of the client, because it only needs to make one single request – the biggest bottleneck if we want to have a responsive mobile interface!
These requests used our generated PHP client, and they used cURL. cURL? Let’s talk about this for a bit. And let’s take a step back. The interesting question is how to bring concurrency into the single-threaded world of PHP.
cURL is cool
We’re in an HTTP context, so what about making additional HTTP requests for concurrency? We examined whether this could be done with cURL.
Some time in 2013, Paul tweeted
“curl_multi_info_read() is my new event loop.”
In a hack week project, Paul and Matt from Etsy’s core team figured out that we could in fact achieve concurrency in the HTTP layer, through parallel cURL calls with curl_multi_info read. The HTTP layer is an interesting layer for this, since there are many existing solutions for routing, load balancing and caching.
In addition to cURL, we added logic to establish dependencies on requests to other endpoints, which we call proxies. We are running the requests when the corresponding proxy becomes unblocked, similar to an event loop, which you might know from NodeJS. The whole concurrency dependency analysis and scheduling is encapsulated within one piece of software, which we call the curl callback orchestrator.
This is great, because from the endpoint author’s point of view the code looks sequential and single-threaded and is just a list of proxy calls to other endpoints. We’re getting closer to a declarative style, expressing our intent, and the orchestrator figures out how to schedule the calls that are necessary for the complete result.
Ok, so we had some good observations about the previous versions of our API, and we have a working prototype for concurrency via cURL.
How did we grow an entire new API framework from here?
Perspectives and Services
Two concepts are special about Etsy’s API v3: perspectives and services.
Perspectives clarify data access rules and give us security hints on what code is permitted for each perspective. They express on whose behalf an API call is being made. So, for example, the Public perspective shows data that a logged-out user would be able to see on Etsy.com.
The Member perspective is for calls made on behalf of a particular Etsy member. The user ID is determined via the user cookie or OAuth token, dependent on the Service, which we will talk about below. The Shop perspective is similar to the member perspective but is for a shop. The framework will verify that the given shop is owned by the authenticated user. The Admin perspective is like the member perspective but for Etsy Admin. We occasionally want to take actions from our own servers that may not fit the other perspectives. For this we have the Infrastructure perspective. It is only available on the private internal API and can be used for things such as dataset loading. The application perspective is for calls made on behalf of a particular API application. It contains the application data for the verified API key.
While perspectives express on whose behalf a call is being made, the service indicates from where the call is being made. A service can also be thought of as the entry point into the API framework. Each service has its own requirements regarding authentication. Endpoints are included in some services by default. Other services are opt-in, and each endpoint has to declare whether it wants to be exposed on those opt-in services.
An example API call
Let’s look at an example request to the etsy.com homepage. We know what the homepage looks like: sections of information that might be interesting for me, as a potential buyer. Up at the top are the listings that I favorited, then some picks that Etsy’s recommendation algorithms picked for me, new items from my favorite shops, activity from my friends, and so on. I think about it as something like this.
If we look at the data in more detail, we see even more structure. It’s like a tree, growing from left to right.
Our setup of network and servers is mirroring the structure of the API call. It starts with an HTTP request from my browser to Etsy’s web server. From there, a bespoke API request is being made to our API server, requesting a personalized version of the homepage data. Internally, this request consists of multiple concurrent components. They themselves are fetched via API requests. Such as my favorites, which are a concurrent component, because they are a large number of listing cards that can be fetched in parallel.
So we can imagine an API request as a multi-level tree, kicking off other API requests and constructing an overall result from the results of those subrequests.
Domain specific language of API endpoints
The project that got me started diving deep into Etsy’s API v3 framework was striving to unify the syntax of API endpoints. This was really fun and involved big, automated changes to unify the API codebase. In the past, there were multiple styles in which endpoints could be written. To unify them, we carved out a language of endpoint building blocks.
Some building blocks are mandatory for each endpoint. Each endpoint needs to declare its route, so we know where it should be found on the web. Also, it needs a human readable description, and a resultType.
The result type describes what type of data the endpoint returns. All data we return is JSON encoded, but here we can say that we return a primitive data type, such as a string or a boolean inside that encoding. Or we could return what we call “a typed resource” – a compound type that refers to a specific component of the Etsy application domain, such as a ListingCard.
And then there is the handle function. In there, every endpoint runs the code that it needs to run, to build its response.
Optional building blocks of an API endpoint are also possible. declareInput is only necessary if the endpoint does actually need input parameters. If it doesn’t, the function can be left out.
The includedServices function allows an endpoint to opt into specific services. The EtsyApps service is opt-in for example, so if you want to make your endpoint available on the apps, you have to opt into the EtsyApps service via this function.
And then there is the cacheTtlSeconds function, which allows you to specify whether an endpoint should be cached, and what should be it’s time to live.
Input and output: Typed parameters, typed result
The first step when a request is being routed to the endpoint, is the setup of the input parameters. We create an input object based on the request’s URL and the endpoint’s declareInput function.
The input declaration tells us how to check for optional or mandatory input parameters, which are parsed according to a pattern in the route. If a parameter is missing or of the wrong type, the framework returns an HTTP error code and message. The input declaration specifies a type for each parameter, such as a string or a user ID. The types are Etsy-specific, and each one comes with its own validation function which is being run by the framework. According to the perspective, information about the logged in user, the logged in admin, shop, or authenticated app is being checked as well, and added to the input object.
Each endpoint specifies its own output type via the resultType function. Currently, those types are optional and of different level of detail. We encourage developers to either return a primitive datatype, or to build a compound type, called typed resource, corresponding to the shape of the data that their endpoint returns. Type guarantees are useful for the API clients, and bring us one step closer to having guarantees on our data from the browser input field to the the database record.
Tooling: API compiler
We need two more pieces of software, which we can automatically compile based on the endpoint declaration files. This is the job of the API compiler. Initially, this was a script that took the routes from the endpoint declarations, together with the service and perspective information, and compiled these into full routes for apache by modifying the .htaccess files. Performance concerns were alleviated by splitting up the work and files by perspective.
This is the first post in a series of three about Etsy’s API, the abstract interface to our logic and data. The next post covers the operational side of Etsy’s API.
Etsy believes in the power of diversity. We believe that having diverse perspectives will help us make better decisions and build better products. We also know that it’s not enough to just recruit diverse talent: we’ve got to retain it!
A key to retaining diverse talent is fostering a supportive work environment. There are a lot of major organizational changes that can help (flexible work arrangements, equal pay, and opportunities for growth and leadership to name a few), but what can you—the individual—really do to help?
It sounds like you want to be an ally! An ally is a person in a position of privilege who offers to share the power, access, and authority that come with that privilege with members of a non-privileged group.
Diversity is intersectional, not limited to gender, race, or any other single axis of identity. Great news: Allyship is intersectional as well! If you’re a man, you can serve as an ally to women. If you’re white, you can serve as an ally to people of color. If you can see, you can serve as an ally to people with vision loss. Anyone can use their privilege to create opportunities for people more marginalized than themselves.
On August 11 in Dublin, Etsy software engineers Toria Gibbs and Ian Malpass will be running a workshop on being an effective male ally to people who identify as women and other underrepresented populations in tech.
One important strategy for being an effective ally is self-education. Women are frequently expected to teach introductory feminism and entertain discussions on “being a woman in tech” with anyone who asks. It’s a great burden to shoulder and frankly a waste of their time. You wouldn’t ask Rasmus to teach you how to write a Hello World program in PHP, right? No! You would go out and find the articles, tutorials, and forum threads that already exist for beginners.
With that, we introduce our list of recommended reading for allies.
Why do we need feminism? Analogies on privilege
Opinion pieces, personal experiences
Other fun stuff
While this list is not exhaustive, it should be more than enough to get you started on your journey. Happy learning!
If you’re interested in hosting your own event to promote male allyship, we recommend checking out NCWIT’s Male Allies and Advocates Toolkit or Ada Initiative’s Ally Skills Workshop.
You can read more about Etsy’s diversity in our latest annual Diversity and Equality Progress Report.
Update 08/12/2016: Slides from Toria and Ian’s presentation are now available on Speaker Deck!
Spring has sprung and we’re back to share how Etsy’s performance fared in Q1 2016. In order to analyze how the site’s performance has changed over the quarter, we collected data from a week in March and compared it to a week’s worth of data from December. A common trend emerged from both our back-end and front-end testing this quarter: we saw significant site-wide performance improvements across the board.
Several members of Etsy’s performance team have joined forces to chronicle the highlights of this quarter’s report. Moishe Lettvin will start us off with a recap of the server-side performance, Natalya Hoota will discuss the synthetic front-end changes and Allison McKnight will address the real user monitoring portion of the report. Let’s take a look at the numbers and all the juicy context.
Server-side time measures how long it takes our servers to build pages. These measurements don’t include any client-side time — this measurement is the amount of time it takes from receiving an HTTP request for a page to returning the response.
As always, we start with these metrics because they represent the absolute lowest bound for how long it could take to show a page to the user, and changes in these metrics will be reflected in all our other metrics. This data is calculated by a random sample of our webserver logs.
Happily, this quarter we saw site-wide performance improvements, due to our upgrade to PHP7. While our server-side time isn’t spent solely running PHP code (we make calls to services like memcache, MySQL, Redis, etc.), we saw significant performance gains on all our pages.
One of the primary ways that PHP7 increases performance is by decreasing memory usage. In the graph below, the blue & green lines represent 95th percentile of memory usage on a set of our PHP5 web servers, while the red line indicates the memory usage on a set of our PHP7 web servers during our switch-over from PHP5 to PHP7. This is two days’ worth of data; the variance in memory usage (more visible in the PHP5 data) is due to daily variation in server load. Note also that the y-axis origin is 0 on this graph — the decrease shown in this graph is to scale!
Another interesting thing that’s visible in the box plots above is how the homepage server-side timing changed. It sped up across the board, but the distribution widened significantly. The reason for this is that the homepage gets many signed-out views as well as signed-in views, as contrasted with, for instance, the cart view page. In addition, the homepage is much more customized for signed-in users than, say, the listing page. This level of customization requires calls to our storage systems, which weren’t sped up by the PHP7 upgrade. Therefore, the signed-in requests didn’t speed up as much as the signed-out requests, which are constrained almost entirely by PHP speed. In the density plot below, you can see that the bimodal distribution of the homepage timing became more distinct in Q1 (red) vs Q4 of last year (blue):
The Baseline page gives us the clearest view into the gains from PHP7 — the page does very little outside of running PHP code, and you can see in the first chart above that the biggest impact was on the Baseline page. The median time went from 80ms to 29ms, and the variance decreased.
Synthetic Start Render
We collect synthetic (i.e., obtained by web browser emulation of scripted web visits) monitoring results in order to cross-check our findings for two other sets of data. We use a third party provider, Catchpoint, to record start render time — a moment when a user first sees content appearing on the screen — as a reference point.
Synthetic tests showed that all pages got significantly faster in Q1.
It is worth noting that the data for synthetic tests is collected for signed-out web requests. As previously mentioned, this type of request involves fetching less state from storage, which highlights PHP7 wins.
I noticed the unusually far-reaching outliers on the shop page and search pages and decided to investigate further. At first, I generated scatterplots for the two pages in question, which clarified that outliers on their own did not have a story to tell — no clustering patterns or high volumes on any particular day. Having a better data sanitation process would have eliminated the points in question altogether.
In order to get a better visual for the synthetic data we used, I took comparative scatterplots of all six pages we monitor. I noticed a reduction of failed tests (marked by red diamonds) that happened sometime between Q4 and present day. Remarkably, we have never isolated that data set before. A closer look revealed an important source of many current failures: fetching third party resources was taking longer than the maximum time allotted by Catchpoint. That encouraged us to consider a new metric for monitoring the impact of third party resources.
Real User Page Load Time
Gathering front-end timing metrics from real users allows us to see the full range of our pages’ performance in the wild. As in past reports, we’re using page load metrics collected by mPulse. The data used to generate these box plots is the median page load time for each minute of one week in each quarter.
We see that for the most part, page load times have gone down for each page. The boxes and whiskers of each plot have moved down, and the outliers tend to be faster as well. But it seems that we’ve also picked up some very slow outliers: each page has a single outlier of over 10 seconds, while the rest of the data points for all pages are far under that. Where did these slow load times come from?
Because each of our RUM data points represents one minute in the week and there is exactly one extremely slow outlier for each page, it makes sense that these points might be all from the same time. Sure enough, looking at the raw numbers shows that all of the extreme outliers are from the same minute!
This slow period happened when a set of boxes that queue Gearman jobs were taken out to be patched for the glibc vulnerability that surfaced this quarter. We use Gearman to run a number of asynchronous jobs that are used to generate our pages, so when the boxes were taken out and fewer resources were available for these jobs, back-end times suffered.
One interesting thing to note is that we actually didn’t notice when this happened. The server-side times (and therefore also the front-end times) for most of our pages suffered an extreme regression, but we weren’t notified. This is actually by design!
Sometimes we experience a blip in page performance that recovers very quickly (as was the case with this short-lived regression). It makes little sense to scramble to understand an issue that will automatically resolve within a few minutes — by the time we’ve read and digested the alert, it may have already resolved itself — so we have a delay built in to our alerts. We are only alerted to a regression if a certain amount of time has passed and the problem hasn’t been resolved (for individual pages, this is 40 minutes; when a majority of our pages’ performance degrades all at once, the delay is much shorter).
This delay ensures that we won’t scramble to respond to an alert if the problem will immediately fix itself, but it does mean that we don’t have any insight into extreme but short-lived regressions like this one, and that means we’re missing out on some important information about our pages’ performance. If something like this happens once a week, is it still something that we can ignore? Maybe something like this happens once a day — without tracking short-lived regressions, we don’t know. Going forward, we will investigate different ways of tracking short-lived regressions like this one.
You may also have noticed that while the slowdown that produced these outliers originated on the server side, the outliers are missing from both the server-side and synthetic data. This is because of the different collection methods that we use for each type of data. Our server-side and synthetic front-end datasets each contain 1,000 data points sampled from visits throughout the week (with the exception of the baseline page, which has a smaller server-side dataset because it receives fewer visits than other pages). This averages to only 142 data points per day — far under one datapoint per minute — and so it’s likely that no data from the short regression made it into the synthetic or server-side datasets at all. Our front-end RUM data, on the other hand, has one datapoint — the median page load time — for every minute, so it was guaranteed that the regression would be represented in that dataset as long as at least 50% of views were affected.
The nuances in the ways that we collect each metric are certainly very interesting, and each method has its pros and cons (for example, heavily sampling our server-side and synthetic front-end data leads to a narrower view of our pages’ performance, but collecting medians for our front-end RUM data to display in box plots is perhaps statistically unsound). We plan to continue iterating on this process to make it more appropriate to the report and more uniform across our different monitoring stacks.
In the first quarter of 2016, performance improvements resulting from the server-side upgrade to PHP7 trickled down through all our data sets and the faster back-end times translated to speed ups in page load times for users. As always, the process of analyzing the data for the sections above uncovered some interesting stories and patterns that we may have otherwise overlooked. It is important to remember that the smaller stories and patterns are just as valuable of learning opportunities as the big wins and losses.
Imagine how you would feel if you went into a grocery store, and the prices were gibberish (“1,00.21 $” or “$100.A”). Would you feel confident buying from this store?
Etsy does business in more than 200 regions and 9 languages. It’s important that our member experience is consistent and credible in all regions, which means we have to format prices correctly for all members.
In this post, I’ll cover:
- Examples of bad currency formatting
- How you can format currency correctly
- Practical implementation decisions we made along the way
In order to follow along, you need to know one important thing: Currency formatting depends on three attributes: the currency, the member’s location, and the member’s language.
Examples of bad currency formatting
Here are some examples of bad currency formatting:
- A member browsing in German goes to your site and sees an item for sale for “1,000.21 €”.
- A Japanese member sees an item selling for “¥ 847,809.34”
- A Canadian member sees “$1.00”.
If you don’t know why the examples above are confusing, read on.
What’s wrong with: A member browsing in German goes to your site and sees an item for sale for “1,000.21 €”?
The first example is the easiest. If a member is browsing in German, the commas and decimals in a price should be flipped. So “1,000.21 €” should really be formatted as “1.000,21 €”. This isn’t very confusing (as a German member, you can figure out what the price is *supposed* to be), but it is a bad experience.
By the way, if you are in Germany, using Euros, but browsing in English, what would you expect to see? Answer: “€1,000.21”. The separators and symbol position are based on language here, not region.
What’s wrong with: A Japanese member sees an item selling for “¥ 847,809.34”?
Japanese Yen doesn’t have a fractional part. There’s no such thing as half a Yen. So “¥ 847,809.34” could mean “¥ 847,809”, or “¥ 84,780,934” or something else entirely.
What’s wrong with: A Canadian member sees “$1.00”?
If your site is US-based, this can be confusing. Does “$” mean Canadian dollar or US dollar here? A simple fix is to add the currency code at the end: “$1.00 USD”.
How to format currency correctly
Etsy’s locale settings picker
Formatting currency for international members is hard. Etsy supports browsing in 9 languages, 23 currencies, and hundreds of regions. Luckily, we don’t have to figure out the right way to format in all of these combinations, because the nice folks at CLDR have done it for us. CLDR is a massive database of formatting styles that gets updated twice a year. The data gets packaged up into a portable library called libicu. Libicu is available everywhere, including mobile phones. If you want to format currency, you can use CLDR data to do it.
For each language + region + currency combination, CLDR gives you:
- The currency symbol
- The currency code
- The decimal and grouping separators
- The format pattern.
A typical pattern looks like this:
A cldr pattern
This is the pattern for German + Germany + Euros. It tells you:
- The currency symbol goes at the end
- There’s a space between the value and the currency symbol
- Euros are grouped in sets of three, like $1,000,000 (vs something like rupees, that are grouped in sets of 2: $1,00,00,000. The pattern for Hindi + India + Rupees is “¤ #,##,##0.###”)
- this is a fractional currency, formatted up to a precision of 2 (vs Japanese Yen, which is not a fractional currency, and uses the format “¤ #,##0”).
NOTE: the pattern does *not* tell you what the decimal and grouping separators are. CLDR gives you those separately, they are not a part of the pattern.
Now you can use this information to format a value:
Practical implementation decisions
CLDR is great, but it is not the ultimate authority. It is a collaborative project, which means that anyone can add currency data to CLDR, and then everyone votes on whether the data looks correct or not. People can also vote to change existing currency data.
CLDR data is not a precise thing, it is fluid and changing. Sometimes you need to customize CLDR for your use case. Here are the customizations we made.
The problem with currencies that use a dollar sign ($)
We use CLDR to format currency at Etsy, but we’ve made some changes to it. One issue in particular has really bugged us. Dollar currencies are really hard to work with. The symbol for CAD (Canadian dollars) is “$” in Canada, but it is “CA$” in the US and everywhere else to avoid confusion with US Dollars. So if we followed CLDR, Canadian members would see “$1.00”. But our Canadian members might know that Etsy is a US-based company, in which case “$” would be ambiguous to them — it could mean either Canadian dollars or US dollars. Here is how we choose a currency symbol to avoid confusion while still meeting member expectations:
What symbol does Etsy use for dollar-based currencies?
Here is the value “1000.21” formatted in different currency + region combinations:
You might be wondering, why not just add the currency code to the end of the price? For example, it could be “$1,000.21 USD” for US dollars, and “$1,000.21 CAD” for Canadian dollars. This is also explicit but we don’t need to have complicated logic to change the currency symbol. But this approach has another issue: redundancy.
Suppose we did add the currency code at the end everywhere to address the CAD problem. Euros would get formatted as “1.000,21 € EUR”, but the “€ EUR” is redundant. Even worse, Swiss Francs doesn’t have a currency symbol, so CLDR recommends using the currency code as the currency symbol. Which means they would see “1.000,21 CHF CHF”, which is definitely redundant:
Adding the currency code at the end is explicit, but doesn’t meet member expectations. Our German members said they didn’t like how “1.000,21 € EUR” looked.
In the end Etsy decided not to show the currency code. Instead, we change the currency symbol as needed to avoid confusion.
Listing price with settings English / Canada / Canadian dollars
Overriding CLDR data
Here’s a simple case where we overrode CLDR formatting. We are a website, so of course we want our prices to be wrapped in html tags so that they can be styled appropriately. For example, on our listings manager, we want to format price input boxes correctly based on locale:
It’s hard to wrap a price in html tags *after* you have done the formatting: sometimes the symbol is at the end, sometimes there’s a space between the symbol and value, and sometimes there isn’t, etc etc. To make this work, the html tags need to be a part of the pattern, so we need to be able to override the CLDR patterns directly.
Ultimately we ended up overriding a lot of the default CLDR data:
- pattern for negative formatting
- adding html tags
Consistent formatting across platforms
We wrote a script that would export all our CLDR overrides as JSON / XML / plist. Every time the overrides change, we run the script to generate new data for all platforms. Here’s what our JSON file looks like right now (excerpt):
"AUD": "#,##0.00 \u00a4",
"BRL": "#,##0.00 \u00a4",
"CAD": "#,##0.00 \u00a4"
We wrote another script to generate test fixtures, which look like this (excerpt):
"100000": "1.000,00 \u20ac",
"100021": "1.000,21 \u20ac"
"100000": "1.000,00 \u20ac",
"100021": "1.000,21 \u20ac"
"100000": "1.000,00 $",
"100021": "1.000,21 $"
This test says that given these settings:
- Show the currency symbol
- Hide the currency code
- Format as plain text, not html
- For the settings de/US/USD
- The value 100021 should be formatted as “1.000,21 $”
We have hundreds of tests in total to check every combination of language/region/currency code with symbol shown vs. hidden, formatted as text vs. html, etc. These expected values get checked against the output of the currency formatters on all platforms, so we know that they all format currency correctly and consistently. Any time an override changes (for example, changing the symbol for CAD to be “CA$” in all regions), we update the CLDR data file so that the new override gets spread to all platforms. Then we update the test fixtures and re-run the tests to make sure the override worked on all platforms.
No more “¥ 847,809.34”! Formatting currency is hard. If you want to do it correctly, use the CLDR data, but make sure that you override it when necessary based on your unique circumstances. I hope our changes lead to a better experience for international members. Thanks for reading!
Machine Translation at Etsy
At Etsy, it is important our global member base can communicate with one another, even when they don’t speak the same language. Whether users are browsing listings, messaging other users, or posting comments in the forums, machine translation is a valuable tool for facilitating multilingual interactions on our site and in our apps.
Listing descriptions account for the bulk of text we machine translate. With over 35 million active listings at an average length of nearly 1,000 characters, and 10 supported site languages, we need to translate a lot of content—and that’s just for listings. We also provide machine translation for listing reviews, forum posts, and conversations (messaging between members). We send text we need to translate to a third party machine translation service, and given the associated cost, there is a limit to the number of characters we can translate per month.
An example listing review translation.
While a user can request a listing translation if we don’t already have one (we call this on-demand translation), translating a listing beforehand and showing a visitor the translation automatically (we call this pre-translation) provides a more fluid browsing experience. Pre-translation also allows listings to surface in search results in multiple languages, both for searches on Etsy and on external search engines like Google.
The Benefits of a Translation Memory
Many of the strings we machine translate from one language to another are text segments we’ve seen before. Our most common segments are used in millions of listings, with a relatively small subset of distinct segments accounting for a very large proportion of the content. For example, the sentence “Thanks for looking!” appears in around 500,000 active listings on Etsy, and has appeared in over 3 million now inactive listings.
Frequency and rank of text segments (titles, tags, and description paragraphs) appearing in listings on Etsy. The distribution of segments roughly conforms to a classic Zipfian shape, where a string’s rank is inversely proportional to its frequency.
In the past, a single text segment that appeared in thousands of listings on Etsy would be re-translated once for every listing. It would also be re-translated any time a seller edited a listing. This was a problem: our translation budget was being spent on millions of repeat translations that would be better used to translate unique content into more languages.
To solve this problem, we built a translation memory. At its simplest, a translation memory stores a text segment in one language and a corresponding translation of that segment in another language. Storing strings in a translation memory allows us to serve translations for these strings from our own databases, rather than making repeated requests to the translation service.
Storing these translations for later reuse has two main benefits:
- Coverage: By storing common translations in the translation memory and serving them ourselves, we drastically reduce the number of duplicate segments we send to the translation service. This process lets us translate seven times more content for the same cost.
- Quality: We’re able to see which text segments are most commonly used on Etsy and have these segments human translated. Overriding these common segments with human translations improves the overall quality of our translations.
We had two main concerns when planning the translation memory architecture:
- Capacity: The more text segments we store in the translation memory, the greater our coverage. However, storing every paragraph from each of our more than 35 million active listings, and a translation of that paragraph for each of our 10 supported languages, would mean a huge database table. Historically, Etsy has rarely had tables exceeding a few billion rows, and we wanted to keep that maximum limit here.
- Deletions: The translation service’s quality is continually improving, and to take full advantage of these improvements we need to periodically refresh entries in the translation memory by deleting older translations. We wanted to be able to delete several hundred million rows on a monthly basis without straining system resources.
The Translation Memory Architecture
Our Translation Memory consists of several separate services, each handling different tasks. A full diagram of the pipeline is below:
A brief overview of each step:
- Splitting into segments: The first step of the translation pipeline is splitting blocks of text into individual segments. The two main choices here were splitting by sentence or splitting by paragraph. We chose the latter for a few reasons. Splitting by sentence gave us more granularity, but our estimated Translation Memory hit rate was only 5% higher with sentences versus paragraphs. The increased hit rate wasn’t high enough to warrant the extra logic needed to split by sentence, nor the multi-fold database size increase to store every sentence, instead of just every paragraph. Moreover, although automatic sentence boundary detection systems can be quite good, a recent study evaluated the most popular systems on user-generated content and found that accuracy peaked at around 95%. In contrast, using newline characters to split paragraphs is straightforward and an almost error-free way to segment text.
- Excluder: The excluder is the first service we run translations through. It removes any text we don’t want to translate. For now this means lines containing only links, numbers, or special characters.
- Human Translation Memory (HTM): Before looking for a machine translation, we check first for an existing human translation. Human translations are provided by Etsy’s professional translators (the same people who translate Etsy’s static site content). These strings are stored in a separate table from the Machine Translation Memory and are updated using an internal tool we built, pictured below.
- Machine Translation Memory (MTM): We use sharded MySQL tables to store our machine translation entries. Sharded tables are a well-established pattern at Etsy, and the system works especially well for handling the large row count needed to accommodate all the text segments. As mentioned earlier, we periodically want to delete older entries in the MTM to clear out unused translations, and make way for improved translations from the translation service. We partition the MTM table by date to accommodate these bulk deletions. Partitioning allows us to drop all the translations from a certain month without worrying about straining system resources or causing lag in our master-master pairs.
- External Translation Service: If there is new translatable content that doesn’t exist in either our HTM or MTM, we send it to the translation service. Once translated, we store the segment in the MTM so it can be used again later.
- Re-stitching segments: Once each of the segments has passed through one of our four services, we stitch them all back together in the proper order.
We implemented the Excluder, HTM, and MTM in that order. Implementing the Excluder first allowed us to refine the text splitting, restitching, and monitoring aspects of the pipeline before worrying about data access. Next we built the HTM and populated it with several hundred translations of the most common terms on Etsy. Finally, at the end of November 2015, we began storing and serving translations from the MTM.
Coverage: As you can see from the graphs above, we now only send out 14% of our translations to the translation service, and the rest we can handle internally. Practically, this means we can pre-translate over seven times more text on the same budget. Prior to implementing the translation memory, we pre-translated all non-English listings into English, and a majority of the rest of our listings into French and German. With the translation memory in place, we are pre-translating all eligible listings into English, French, German, Italian, and Japanese, with plans to scale to additional languages.
Quality: Around 1% of our translations (by character count), are now served by the human translation memory. These HTM segments are mostly listing tags. These tags are important for search results and are easily mis-translated by an MT system because they lack the context a human translator can infer more easily. Additionally, human translators are better at conveying the colloquial tone often used by sellers in their listing descriptions. With the HTM in place, the most common paragraph on Etsy, “Thanks for looking!” is human translated into the friendlier, “Merci pour la visite !” rather than the awkward, “Merci pour la recherche !” The English equivalent of this difference would be, “Thanks for visiting!” versus “Thanks for researching!”
Monitoring: Since a majority of our translation requests are now routed to the MTM rather than the third-party translation service, we monitor our translations to make sure they are sufficiently similar to those served by the translation service. To do this, we sample 0.1% of the translations served from the MTM and send an asynchronous call to the translation service to provide a reference translation of the string. Then we log the similarity (the percentage of characters in common) and Levenshtein distance (also known as edit distance) between the two translations. As shown in the graph below, we track these metrics to ensure the stored MTM translations don’t drift too far from the original third party translations.
For comparison, as you can see below, the similarity for HTM translations is not as high, reflecting the fact that these translations were not originally drawn from the third party translation service.
Correcting mis-translations: Machine translation engines are trained on large amounts of data, and sometimes this data contains mistakes. The translation memory gives us more granular control over the translated content we serve, allowing us to override incorrect translations while the translation service we use works on a fix. Below is an example where “Realistic bird” is mis-translated into German as “Islamicrevolutionservice.”
With the translation memory, we can easily correct problematic translations like this by adding an entry to the human translation memory with the original listing title and the correct German translation.
Respecting sellers’ paragraph choices: Handling paragraph splitting ourselves had the additional benefit of improving the quality of translation for many of our listings. Etsy sellers frequently include lists of attributes and other information without punctuation in their listings. For example:
Dimensioni 24×18 cm
Spedizione in una scatola protettiva in legno
Verrà fornito il codice di monitoraggio (tracking code)
The translation service often combines these lists into a single sentence, producing a translation like this:
Size 24 x 18 cm in a Shipping box wooden protective supplies the tracking code (tracking code)
By splitting on paragraphs, our sellers’ choice of where to put line breaks is now always retained in the translated output, generating a more accurate (and visually appealing) translation like this:
Size 24 x 18 cm
Shipping in a protective wooden box
You will be given the tracking code (tracking code)
Splitting on paragraphs prior to sending strings out for translation is an improvement we could have made independent of the translation memory, but it came automatically with the infrastructure needed to build the project.
Greater accuracy for listing translations means buyers can find the items they’re looking for more easily, and sellers’ listings are more faithfully represented when translated. To continue improving quality, over the next month we are rolling out a machine translation engine trained on top of the translation service’s generic engine. A machine translation engine customized with Etsy-specific data, in conjunction with more human translated content, will produce higher-quality translations that more closely reflect the colloquialisms of our sellers.
Building a community-centric, global marketplace is a core tenet of Etsy’s mission. Machine translation is far from perfect, but it can be a valuable tool when fostering an online community built around human interaction. The translation memory allows us to bring this internationalized Etsy experience to more users in more languages, making it easier to connect buyers and sellers from around the world.
Co-authored with Ben Russell
At Etsy, the vast majority of our computing happens on physical servers that live in our own data centers. Since we don’t do much in the cloud, we’ve developed tools to automate away some of the most tedious aspects of managing physical infrastructure. This tooling helps us take new hardware from initial power on to being production-ready in a manner of minutes, saving time and energy for both data center technicians racking hardware and engineers who need to bring up new servers. It was only recently, however, that this toolset started getting the love and attention that really exemplifies the idea of code as craft.
The Indigo Tool Suite
The original idea for this set of tools came from a presentation on Scalable System Operations that a few members of the ops team saw at Velocity in 2012. Inspired by the Collins system that Tumblr had developed but disappointed that it wasn’t yet (at the time) open source or able to work out of the box with our particular stack of infrastructure tools, the Etsy ops team started writing our own. In homage to Tumblr’s Phil Collins tribute, we named the first ruby script of our own operations toolset after his bandmate Peter Gabriel. As that one script grew into many, that naming scheme continued, with the full suite and all its components eventually being named after Gabriel and his songs.
While many of the technical details of the architecture and design of the tool suite as it exists today are beyond the scope of this post, here is a brief overview of the different components that currently exist. These tools can be broken up into two general categories based on who uses them. The first are components used by our data center team, who handle things like unboxing and racking new servers as well as hardware maintenance and upgrades:
- sledgehammer: a command-line tool for getting a machine set up in RackTables (our datacenter asset management system) and formatting/partitioning disks as needed, such as creating RAID arrays, and configuring the out-of-band management interface. It consists of a default Cobbler profile that machines will boot to if they don’t already have an operating system, so the data center team can power on boxes and have them start installing automatically – we call this sledgehammer’s unattended mode.
- This default Cobbler profile loads a bootable disk image that runs what we call the sledgehammer executor, which performs some setup steps such as loading various configuration files and then installs and runs another software package we create that takes over for the rest of the setup steps.
- This package is called the sledgehammer payload, which consists of some shared Indigo libraries and executable code that actually handles setting up the out-of-band management interface, configuring networking, and saving hardware details back to RackTables. This is built as a separate software package to avoid the friction of rebuilding the entire disk image as often.
The other set of tools are primarily used by engineers working in the office, enabling them to take boxes that have already been set up by the data center team and sledgehammer and get them ready to be used for specific tasks:
- gabriel: a command-line tool for installing an operating system on a machine and getting it installed and configured with Chef
- zaar: a command-line tool for decommissioning boxes and removing them from production
- indigo-sweeper: a daemon that makes sure out-of-band management commands sent to servers and Cobbler syncs aren’t duplicated between multiple users and multiple runs of the tools
- indigo-tailer: a daemon that allows for live-tailing of remote server build logs on the command line and on the web
- indigo-web: a web frontend and API. The web frontend provides a friendlier interface for non-ops engineers who might not know the ins and outs of the command line tool, by providing easier-to-understand form fields rather than relying on a series of command line arguments. The API provides various functionality including an endpoint with a mutex lock to prevent multiple simultaneous builds getting the same IP address assigned to them – this allows multiple boxes to be powered on and provisioned at once without worrying about race conditions.
While many of the details of the inner workings of this automation tooling could be a blog post in and of themselves, the key aspect of the system for this post is how interconnected it is. Sledgehammer’s unattended mode, which has saved our data center team hundreds—if not thousands—of hours of adding server information to RackTables by hand, depends on the sledgehammer payload, sledgehammer executor, API, and the shared libraries that all these tools use all working together perfectly. If any one part of that combination isn’t working with the others, the whole thing breaks, which gets in the way of people, especially our awesome data center team, getting their work done.
Over the years, many many features have been added to Indigo, and as members of the operations team worked to add those features, they tried to avoid breaking things in the process. But testing had never been high on Indigo’s list of priorities – when people started working on it, they thought of it more as a collection of ops scripts that “just work” rather than a software engineering project. Time constraints sometimes played a role as well – for example, sledgehammer’s unattended mode in all its complex glory was rolled out in one afternoon, because a large portion of our recent data center move was scheduled for the next day and it was more important at that point to get that feature out for the DC team to use than it was to write tests.
For years, the only way of testing Indigo’s functionality was to push changes to production and see what broke—certainly not an ideal process! A lack of visibility into what was being changed compounded the frustration with this process.
When I started working on Indigo, I was one of the first people to have touched that code that has a formal computer science background, so one of the first things I thought of was adding unit tests, like we have for so much else of the code we write at Etsy. I soon discovered that, because the majority of the Indigo code had been written without testability in mind, I was going to have to do some significant refactoring to even get to the point where I could start writing unit tests, which meant we had to first lay some groundwork in order to be able to refactor without being too disruptive to other users of these tools. Refactoring first without any way to test the impact of my changes on the data center team was just asking for everyone involved to have a bad time.
Adding Tests (and Testability)
Some of the most impactful changes we’ve made recently have been around finding ways to test the previously untestable unattended sledgehammer components. Our biggest wins in this area have been:
- Adding a test mode for unattended sledgehammer. The command-line sledgehammer tool has always created a configuration file specific to the server being built with that run of the command, and if the config file wasn’t present, the sledgehammer payload would run in unattended mode. By adding a flag to the command-line version, I was able to force the code to run in this unattended mode easily. This also allowed me to set up other options for testing with our existing command line tools, such as…
- Adding versioning to the sledgehammer payload. Previously, there was only ever one version of the payload rpm—the build script was hard-coded to just always use 0.1. I added an option to build a different version number so that people could test that it worked before creating a new production version. Running command-line sledgehammer in unattended mode could then be told to use this new version instead.
- Adding a test version of the sledgehammer disk image/Cobbler profile. By creating a new cobbler profile and making a few minor changes to the script that we use to build the sledgehammer executor, disk image, and payload, we added the ability to test changes to this part of the process as well. Though the executor code only changed very infrequently, if it broke it would significantly get in the way of the data center team getting their work done, so being able to test it makes life much better for them.
- Adding an option to run against a different API URL. Using the server-specific configuration files mentioned previously, I was able to allow people to specify the URL for the Indigo API they wanted to use. This allowed people to run a local version of the API to test changes to it and then test all the build tools against that—a much better solution than just deploying to production and hoping things worked! For example:
With changes like these in place, we are able to have much more confidence that our changes won’t break the unattended sledgehammer tool that is so critical for our data center team. This enables us to more effectively refactor the Indigo codebase, whether that be to improve it in general or to make it more testable.
I gave a presentation at OpsSchool, our internal series of lectures designed to educate people on a variety of operations-related topics inspired by opsschool.org, on how to change the Indigo code to make it better suited to unit testing. Unit testing itself is beyond the scope of this post, but for us, this has meant things like changing method signatures so that objects that might be mocked or stubbed out can be passed in during tests, or splitting up large gnarly methods that grew organically along with the Indigo codebase over the past few years into smaller, more testable pieces. This way, other people on the team are able to help write unit tests for all of Indigo’s shared library code as well.
Deploying, Monitoring, and Planning
As mentioned previously, one of the biggest headaches with this tooling had been keeping all the different moving pieces in sync when people were making changes. To fix this, we decided to leverage the work that had already been put into Deployinator by our dev tools team. We created an Indigo deployinator stack that, among other things, ensures that the shared libraries, API, command line tools, and sledgehammer payload are all deployed at the same time. It keeps these deploys in sync, handles the building of the payload RPM, and restarts all the Indigo services to make sure that we never again run into issues where the payload stops working because it didn’t get updated when one of its shared library files did or vice versa.
Additionally, it automatically emails release notes to everyone who uses the Indigo toolset, including our data center team. These release notes, generated from the git commit logs for all the commits being pushed out with a given deploy, provide some much-needed visibility into how the tools are changing. Of course, this meant making sure everyone was on the same page with writing commit messages that will be useful in this context! This way the data center folks, geographically removed from the ops team making these changes, have a heads up when things might be changing with the tools they use.
Finally, we’re changing how we approach the continued development and maintenance of this software going forward. Indigo started out as a single ruby script and evolved into a complex interconnected set of tools, but for a while the in-depth knowledge of all the tools and their interconnections existed solely in the heads of a couple people. Going forward, we’re documenting not only how to use the tools but how to develop and test them, and encouraging more members of the team to get involved with this work to avoid having any individuals be single points of knowledge. We’re keeping testability in mind as we write more code, so that we don’t end up with any more code that has to be refactored before it can even be tested. And we’re developing with an eye for the future, planning what features will be added and which bugs are highest priority to fix, and always keeping in mind how the work we do will impact the people who use these tools the most.
Operations engineers don’t think of ourselves as developers, but there’s a lot we can learn from our friends in the development world. Instead of always writing code willy-nilly as needed, we should be planning how to best develop the tooling we use, making sure to be considerate of future-us who will have to maintain and debug this code months or even years down the line.
Tools to provision hardware in a data center need tests and documentation just as much as consumer-facing product code. I’m excited to show that operations engineers can embrace the craftsmanship of software engineering to make our tools more robust and scalable.
Happy New Year! It may be 2016, but we’re here to give you a full review of the site performance highlights for the fourth quarter of 2015. For this report, we’ve collected data from a full week in December that we will be comparing to the full week of data from September we gathered for our last report. Unlike the Q3 report, we did not uncover any prevailing trends that spanned every section of this report. In the fourth quarter, we saw a few improvements and a handful of regressions, but the majority of pages remained stable.
As in the past, this report is a collaborative effort. Mike Adler will discuss our server side performance, Kristyn Reith will report on the synthetic front-end data, and the real user monitoring update will come from Allison McKnight and Moishe Lettvin. So without further delay, let’s take a look at the numbers.
We begin our report with the server-side latencies, which is how long it takes for our servers to build pages. This metric does not include any browser-side time. We calculate it by taking a random sample of our web server logs. One reason we start here is that changes in server-side metrics can explain some changes in synthetic and RUM metrics. As you can see below, most pages are executing on the server at about the same speed, though the cart page did get about 27% faster on average.
The fourth quarter of the year obviously includes holiday shopping, which is the busiest time of year for our site. Each year our ops team plans to add capacity in anticipation of the rush, but still, it’s not unusual to discover new performance bottlenecks as traffic ramps up. Each year we have new code, new hardware and new usage patterns. To quote a translation of Heraclitus, “Ever-newer waters flow on those who step into the same rivers.”
In this quarter we discovered that we could improve our efficiency by running numad on our web servers. Our web servers do not use any virtualization, so one kernel is scheduling across 24 physical cores (hyper-threading disabled, for now). We noticed that some cores were significantly busier than others and that was effectively limiting throughput and increasing latency. An Etsy engineer learned that by simply running numad, the cpu workload was more balanced, leading to better efficiency. In short, our server-side metrics no longer slowed down during busy times of the day.
Today’s servers are built with NUMA (Non-uniform memory access) architectures, which creates an incentive to schedule a tasks on CPUs that are “close” to the memory they need. Depending on many factors (hardware, workload, other settings), this scheduling challenge can result in suboptimal efficiency. We found that numad, a userland daemon that assigns processes to numa zones, is a simple and effective way to optimize for our current conditions.
We saw that our search page performance got a little slower on average, but we expected this due to launching some more computationally expensive products (such as our Improved Category Pages).
Synthetic Start Render
For our synthetic testing, we’ve set up tests using a third party service which simulates actions taken by a user and then automatically reloads the test pages every ten minutes to generate the performance metrics. As mentioned in the last report, due to recent product improvements, we have decided to retire “Webpage Response” as a metric used for this report, so we will be focusing on the “Start Render” metric in the boxplots below. Overall, we did not see any major changes in start render times this quarter that would have impacted user experience.
You may notice that the metrics for Q3 differ from the last report and that the start render times have significantly wider ranges. In the last two reports the plotted data points have represented median measurements, which limited our analysis to the median metrics. In this report, the boxplots are constructed using raw data, thereby providing a more accurate and holistic representation of each page’s performance. Since the raw data captures the full distribution of start render times, we are seeing more outliers than we have in the past.
As you can see above, the start render time has remained fairly stable this quarter. Though nearly all the pages experienced a very minor increase in median start render time, the homepage saw the greatest impact. This slowdown can be attributed to the introduction of a new font. Though this change added additional font bytes to all the pages, we discovered that the scope of the regression significantly varied depending on the browser. The data displayed in the boxplots is from tests run in IE9. The graphs below show the days surrounding the font change.
While the font change resulted in a noticeable jump in start render time in IE, start render performance in Chrome remained relatively unaffected . The difference in font file formats displayed by the browsers is partially responsible for the disparity in performance. The font format selected by Chrome (woff2) uses a compression algorithm that reduces the font file by roughly 30%, resulting in a substantially smaller file size when compared to other formats. Additionally, the IE browser running the synthetic test has the compatibility view enabled, meaning that although it’s effectively using IE9, the browser is rendering the pages with an even older version of IE. Therefore, the browser is downloading all the font files included in the pages corresponding CSS file regardless of whether or not they are used on the page.
Since Chrome is more commonly used among Etsy users than IE9, we have set up new synthetic tests in Chrome. For all future reports we will use the data generated by the new Chrome tests to populate the synthetic boxplots. We feel that this change, coupled with the continued use of raw data will provide a more realistic representation of what our users experience.
Real User Page Load Time
“Real User” data can show us variations in the actual speed that our users experience, depending on their geographic location, what browser they’re using, the internet provider they’re using, and so on. Sometimes the richness of this data lets us see trends that we couldn’t see in our other two monitoring methods; other times, the variety of data makes trends harder to see.
This quarter, we mostly saw small or no change in our metrics. The only delta that’s noticeable is the fifth percentile in the Shop Home page, which increased from 2.3 seconds to 2.6 seconds. A time series graph of the fifth percentile shows an uptick corresponding to our new page font. This parallels what we found in the synthetic tests, as discussed above.
Because shop_home is one of our most-consistently fast pages, it tends to be more sensitive to changes in overall site load time. That is, it shows deltas that might get “lost in the noise” on pages with higher variance.
With this context, it can be interesting to look at the data day-by-day in addition to the week vs. week comparison that the box plot above shows us too. You can see below that even with the fairly large difference in the fifth percentile seen in the box plot, on the last day of the comparison weeks the slower and faster lines actually trade positions.
Despite the fourth quarter being the busiest time of the year, users did not experience degraded performance and most of the fluctuations in page speed were negligible. For this report we improved the data quality in the synthetic section by switching from median measurements to raw data. In the coming 2016 reports, we will strive to make more improvements to our methodology. If you have any thoughts or questions, please leave a comment!
Retrospectives (retros) are meetings that take place after every major project milestone to help product teams at Etsy improve processes and outcomes in the next iteration of a project. The insights gained from project retros are invaluable to proactively mitigating problems in future projects while promoting continuous learning.
I am one of the managers on the Product Quality (PQ) team, which is Etsy’s centralized resource for manual QA. When manual QA was first introduced at Etsy, testers joined a team for the duration of the project but had limited opportunity to get objective feedback. Consequently, testers were kept from seeing the importance of their contributions and from viewing themselves as true members of the product team they worked with. This lack of communication and continuous learning left our testers with less job satisfaction and feelings of empowerment than their peers in product engineering.
We decided to try QA-focused retros to surface feedback that would help us identify repeatable behaviors that contribute to consistently successful product launches. We were also interested in empowering Product Quality team members to understand how their specific contributions impact product launches and allow them to take more ownership of their responsibilities.
Regularly scheduled QA retros have helped to promote mindfulness and accountability on the PQ team. Over time, they have solidified relationships with product owners, designers, and engineers and reinforced the sense that we are all working toward the same goal.
Here’s some information to help you run a QA-focused retro:
- Identify tacit, repeatable behaviors that contributed to increased confidence before launch
- Promote accountability within the team
- Foster self efficacy among team members
- Identify fruitless behaviors and actions that can be removed from the QA process
- What went well?
- What could have gone better?
- Takeaways? What patterns do we want to avoid in the future? What patterns do we want to repeat in other projects?
How Does It Work?
The retro should be scheduled after the initial product launch or any milestone that required a significant QA effort. Participants should receive the agenda ahead of time and be asked to come prepared with their thoughts on the main questions. The QA engineer who led the testing effort should facilitate the retro by prompting attendees, in a round robin fashion, to weigh in on each agenda item.
The facilitator also participates by giving insights to the product team they partnered with for launch. This 30-minute activity is best if scheduled near the end of the day and can be augmented with snacks and beverages to create a relaxed atmosphere. The facilitator should record insights and report any interesting findings back to the QA team and the product team they worked with on the project.
Who Should Attend?
Participants should be limited to those who directly interacted with QA during development. These are usually:
- Product Managers
- Software Engineers
- QA Testers
- Product Designers
What Happens Next?
Everyone on the Product Quality team reviews the takeaways from the retro and clarifies any questions with those who participated in the retro. We then make the applicable adjustments to processes prior to the next iteration of a project. All changes made to the PQ process are communicated to team members and product teams as needed.
QA-focused retros have empowered our team to approach testing as a craft that is constantly honed and developed. The meetings help product managers learn how to get the most out of working with the QA team and provide opportunities for product teams to participate in streamlining the QA process.