Q1 2016 Site Performance Report
Spring has sprung and we’re back to share how Etsy’s performance fared in Q1 2016. In order to analyze how the site’s performance has changed over the quarter, we collected data from a week in March and compared it to a week’s worth of data from December. A common trend emerged from both our back-end and front-end testing this quarter: we saw significant site-wide performance improvements across the board.
Several members of Etsy’s performance team have joined forces to chronicle the highlights of this quarter’s report. Moishe Lettvin will start us off with a recap of the server-side performance, Natalya Hoota will discuss the synthetic front-end changes and Allison McKnight will address the real user monitoring portion of the report. Let’s take a look at the numbers and all the juicy context.
Server-side time measures how long it takes our servers to build pages. These measurements don’t include any client-side time — this measurement is the amount of time it takes from receiving an HTTP request for a page to returning the response.
As always, we start with these metrics because they represent the absolute lowest bound for how long it could take to show a page to the user, and changes in these metrics will be reflected in all our other metrics. This data is calculated by a random sample of our webserver logs.
Happily, this quarter we saw site-wide performance improvements, due to our upgrade to PHP7. While our server-side time isn’t spent solely running PHP code (we make calls to services like memcache, MySQL, Redis, etc.), we saw significant performance gains on all our pages.
One of the primary ways that PHP7 increases performance is by decreasing memory usage. In the graph below, the blue & green lines represent 95th percentile of memory usage on a set of our PHP5 web servers, while the red line indicates the memory usage on a set of our PHP7 web servers during our switch-over from PHP5 to PHP7. This is two days’ worth of data; the variance in memory usage (more visible in the PHP5 data) is due to daily variation in server load. Note also that the y-axis origin is 0 on this graph — the decrease shown in this graph is to scale!
Another interesting thing that’s visible in the box plots above is how the homepage server-side timing changed. It sped up across the board, but the distribution widened significantly. The reason for this is that the homepage gets many signed-out views as well as signed-in views, as contrasted with, for instance, the cart view page. In addition, the homepage is much more customized for signed-in users than, say, the listing page. This level of customization requires calls to our storage systems, which weren’t sped up by the PHP7 upgrade. Therefore, the signed-in requests didn’t speed up as much as the signed-out requests, which are constrained almost entirely by PHP speed. In the density plot below, you can see that the bimodal distribution of the homepage timing became more distinct in Q1 (red) vs Q4 of last year (blue):
The Baseline page gives us the clearest view into the gains from PHP7 — the page does very little outside of running PHP code, and you can see in the first chart above that the biggest impact was on the Baseline page. The median time went from 80ms to 29ms, and the variance decreased.
Synthetic Start Render
We collect synthetic (i.e., obtained by web browser emulation of scripted web visits) monitoring results in order to cross-check our findings for two other sets of data. We use a third party provider, Catchpoint, to record start render time — a moment when a user first sees content appearing on the screen — as a reference point.
Synthetic tests showed that all pages got significantly faster in Q1.
It is worth noting that the data for synthetic tests is collected for signed-out web requests. As previously mentioned, this type of request involves fetching less state from storage, which highlights PHP7 wins.
I noticed the unusually far-reaching outliers on the shop page and search pages and decided to investigate further. At first, I generated scatterplots for the two pages in question, which clarified that outliers on their own did not have a story to tell — no clustering patterns or high volumes on any particular day. Having a better data sanitation process would have eliminated the points in question altogether.
In order to get a better visual for the synthetic data we used, I took comparative scatterplots of all six pages we monitor. I noticed a reduction of failed tests (marked by red diamonds) that happened sometime between Q4 and present day. Remarkably, we have never isolated that data set before. A closer look revealed an important source of many current failures: fetching third party resources was taking longer than the maximum time allotted by Catchpoint. That encouraged us to consider a new metric for monitoring the impact of third party resources.
Real User Page Load Time
Gathering front-end timing metrics from real users allows us to see the full range of our pages’ performance in the wild. As in past reports, we’re using page load metrics collected by mPulse. The data used to generate these box plots is the median page load time for each minute of one week in each quarter.
We see that for the most part, page load times have gone down for each page. The boxes and whiskers of each plot have moved down, and the outliers tend to be faster as well. But it seems that we’ve also picked up some very slow outliers: each page has a single outlier of over 10 seconds, while the rest of the data points for all pages are far under that. Where did these slow load times come from?
Because each of our RUM data points represents one minute in the week and there is exactly one extremely slow outlier for each page, it makes sense that these points might be all from the same time. Sure enough, looking at the raw numbers shows that all of the extreme outliers are from the same minute!
This slow period happened when a set of boxes that queue Gearman jobs were taken out to be patched for the glibc vulnerability that surfaced this quarter. We use Gearman to run a number of asynchronous jobs that are used to generate our pages, so when the boxes were taken out and fewer resources were available for these jobs, back-end times suffered.
One interesting thing to note is that we actually didn’t notice when this happened. The server-side times (and therefore also the front-end times) for most of our pages suffered an extreme regression, but we weren’t notified. This is actually by design!
Sometimes we experience a blip in page performance that recovers very quickly (as was the case with this short-lived regression). It makes little sense to scramble to understand an issue that will automatically resolve within a few minutes — by the time we’ve read and digested the alert, it may have already resolved itself — so we have a delay built in to our alerts. We are only alerted to a regression if a certain amount of time has passed and the problem hasn’t been resolved (for individual pages, this is 40 minutes; when a majority of our pages’ performance degrades all at once, the delay is much shorter).
This delay ensures that we won’t scramble to respond to an alert if the problem will immediately fix itself, but it does mean that we don’t have any insight into extreme but short-lived regressions like this one, and that means we’re missing out on some important information about our pages’ performance. If something like this happens once a week, is it still something that we can ignore? Maybe something like this happens once a day — without tracking short-lived regressions, we don’t know. Going forward, we will investigate different ways of tracking short-lived regressions like this one.
You may also have noticed that while the slowdown that produced these outliers originated on the server side, the outliers are missing from both the server-side and synthetic data. This is because of the different collection methods that we use for each type of data. Our server-side and synthetic front-end datasets each contain 1,000 data points sampled from visits throughout the week (with the exception of the baseline page, which has a smaller server-side dataset because it receives fewer visits than other pages). This averages to only 142 data points per day — far under one datapoint per minute — and so it’s likely that no data from the short regression made it into the synthetic or server-side datasets at all. Our front-end RUM data, on the other hand, has one datapoint — the median page load time — for every minute, so it was guaranteed that the regression would be represented in that dataset as long as at least 50% of views were affected.
The nuances in the ways that we collect each metric are certainly very interesting, and each method has its pros and cons (for example, heavily sampling our server-side and synthetic front-end data leads to a narrower view of our pages’ performance, but collecting medians for our front-end RUM data to display in box plots is perhaps statistically unsound). We plan to continue iterating on this process to make it more appropriate to the report and more uniform across our different monitoring stacks.
In the first quarter of 2016, performance improvements resulting from the server-side upgrade to PHP7 trickled down through all our data sets and the faster back-end times translated to speed ups in page load times for users. As always, the process of analyzing the data for the sections above uncovered some interesting stories and patterns that we may have otherwise overlooked. It is important to remember that the smaller stories and patterns are just as valuable of learning opportunities as the big wins and losses.