Q3 2014 Site Performance Report
We’re well into the fourth quarter of 2014, and it’s time to update you on how we did in Q3. This is either a really late performance report or an early Christmas present – you decide! The short version is that server side performance improved due to infrastructure changes, while front-end performance got worse because of third party content and increased amounts of CSS/JS across the site. In most cases the additional CSS/JS came from redesigns and new features that were part of winning variants that we tested on site. We’ve also learned that we need some better front-end monitoring and analysis tools to detect regressions and figure out what caused them.
We are also trying something new for this post – the entire performance team is involved in writing it! Allison McKnight is authoring the server side performance section, Natalya Hoota is updating us on the real user monitoring numbers, and Jonathan Klein is doing synthetic monitoring, this introduction, and the conclusion. Let’s get right to it – enter Allison:
Server Side Performance – From Allison McKnight
Here are median and 95th percentile times for signed-in users on October 22nd:
The homepage saw a significant jump in median load times as well as an increase in 95th percentile load times. This is due to our launch of the new signed-in homepage in September. The new homepage shows the activity feed of the signed-in user and requires more back-end time to gather that data. While the new personalization on the homepage did create a performance regression, the redesign had a significant increase in business and engagement metrics. The redesign also features items from a more diverse set of shops on the homepage and displays content that is more relevant to our users. The new version of the homepage is a better user experience, and we feel that this justifies the increase in load time. We are planning to revisit the homepage to further tune its performance in the beginning of 2015.
The rest of the pages saw a nice decrease in both median and 95th percentile back-end times. This is due to a rollout of some new hardware – we upgraded our production memcached boxes.
We use memcached for a lot of things at Etsy. In addition to caching the results of queries to the database, we use memcached to cache things like a user’s searches or activity feed. If you read our last site performance report, you may remember that in June we rolled out a dedicated memcached cluster for our listing cards, landing a 100% cache hit rate for listing data along with a nice drop in search page server side times.
Our old memcached boxes had 1Gb networking and 48GB of memory. We traded them out for new machines with 10Gb networking, 128GB of memory, faster clocks, and a newer architecture. After the switch we saw a drop in memcached latency. This resulted in a visible drop in both the median and 95th percentile back-end times of our pages, especially during our peak traffic times. This graph compares the profile page’s median and 95th percentile server side times on the day after the new hardware was installed and on the same day a week before the switch:
All of the pages in this report saw a similar speedup.
At this point, you may be wondering why the search page has seen such a small performance improvement since July if it was also affected by the new memcached machines. In fact, the search page saw about the same speedup from the new boxes as the rest of the pages did, but this speedup was balanced out by other changes to the page throughout the past three months. A lot of work has been going on with the search page recently, and it has proven hard to track down any specific changes that have caused regressions in performance. All told, in the past three months we’ve seen a small improvement in the search page’s performance and many changes and improvements to the page from the Search team. Since this is one of our faster pages, we’re happy with a small performance improvement along with the progress we’ve made towards our business goals in the last quarter.
Synthetic Front-End Performance – From Jonathan Klein
As a reminder, these tests are run with Catchpoint. They use IE9, and they run from New York, London, Chicago, Seattle, and Miami every 30 minutes (so 240 runs per page per day). The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page. These numbers are medians, and here is the data for October 15th, 2014.
Unfortunately load time increased across all pages, with a larger increase on the search page. The overall increase is due to some additional assets that are served on every page. Specifically we added some CSS/JS for header/footer changes that we are experimenting with, as well as some new third party assets.
The larger increase on the search page is due to a redesign of that page, which increased the overall amount of CSS/JS on the page significantly. This pushed up both start render (extra CSS) and Webpage Response (additional page weight, more JS, etc.).
Seeing an increase of 100ms across the board isn’t a huge cause for concern, but the large increase on search warrants a closer look. Our search team is looking into ways to retain the new functionality while reducing the amount of CSS/JS that we have to serve.
Real User Front-End Performance – From Natalya Hoota
So, what are our users experiencing? Let us look at the RUM (Real User Monitoring) data collected by mPulse. First, median and 95% total page load time, measured in seconds.
As you see, real user data in Q3 showed an overall, and quite significant, performance regression on all pages.
Page load time can be viewed as a sum of back-end time (time until the browser receives the first byte of the response) and front-end time (time from the first byte until the page is finished rendering). Since we had a significant discrepancy between our RUM and synthetic data, we need a more detailed analysis.
Back-End Time – RUM Data
Our RUM metrics for server-side performance, defined as time to the first byte in mPulse, did not show a significant change since Q2, which matched closely with both our server-side performance analysis and synthetic data. The differences are small enough to be almost within rounding error, so there was effectively no regression.
Front-End Time – RUM Data
If synthetic data showed a slight increase in load time, our real user numbers showed this regression amplified. It could be a number of things: internal changes such as our experiments with header and footer, increased number of assets and their handling on clients side, or external factors. To figure out the reasons behind our RUM numbers, we asked ourselves a number of questions:
Q. Was our Q3 RUM data significantly larger than Q2 due to an increase in site traffic?
A. No. We found no immediate relation between performance degradation and a change in beacon volume. We double-checked with our internally recorded traffic patterns and found that beacon count is consistent with it, so we crossed this hypothesis off the list.
Q. Has our user base breakdown changed?
A. Not according to the RUM data breakdown. We did a detailed breakdown of the dataset by region and operating system and, yet again, found that our traffic patterns did not change geographically. It also appears that we had more RUM signals coming from desktop in Q3 than in Q2.
What remains is taking a closer look at internal factors.
One clue was our profile page metrics. From the design perspective, the only change on that page was global header / footer experiments. For the profile page alone, those resulted in a 100% and 45% increase in the number of page CSS files and JS assets, respectively. The profile page suffered one of the highest increases in median and 95% load time.
We already know that global header and footer experiments, along with additional CSS assets served, affected all pages to some degree. It is possible that in other pages the change was balanced out by architecture and design improvements, while in profile page we are seeing an isolated impact of this change.
To prove or dismiss this, we’ve learned that we need better analysis tooling than we currently have. Our synthetic tests run only on signed-out pages, hence it would be more accurate to compare them to similar RUM data set. However, we are currently unable to filter signed-in and signed-out RUM user data per page. We are planning to add this feature early in 2015, which will give us the ability to better connect a particular performance regression to a page experiment that caused it.
Lastly, for much of the quarter we were experiencing a persistent issue where assets failed to be served in a gzipped format from one of our CDNs. This amplified the impact of the asset growth, causing a significant performance regression in the front-end timing. This issue has been resolved, but it was present on the day when the data for this report was pulled.
Conclusion – From Jonathan Klein
The summary of this report is “faster back-end, slower front-end”. As expected the memcached upgrade was a boon for server side numbers, and the continued trend of larger web pages and more requests didn’t miss Etsy this quarter. These results are unsurprising – server hardware continues to get faster and we can reap the benefits of these upgrades. On the front-end, we have to keep pushing the envelope from a feature and design point of view, and our users respond positively to these changes, even if our pages don’t respond quite as quickly. We also learned this quarter that it’s time to build better front-end performance monitoring and data analysis tools for our site, and we’ll be focusing on this early in 2015.
The continuing challenge for us is to deliver the feature rich, beautiful experiences that our users enjoy while still keeping load time in check. We have a few projects in the new year that will hopefully help here: testing HTTP/2, a more nuanced approach to responsive images, and some CSS/JS/font cleanup.