Note: This article was adapted from an internal Etsy newsletter earlier this year. As the holidays roll around, it seemed like a timely opportunity to share what we do with a larger audience.
As the calendar year draws to a close, people’s thoughts often turn to fun activities like spending time with family and friends and enjoying pumpkin or peppermint flavored treats. But for retailers, the holiday season is an intense and critically important period for the business.
The months of November and December compose nearly a fifth of all US retail sales and pretty much every retailer needs to undertake special measures during the holidays, from big sales promotions to stocking up on popular items, to hiring additional staff to stock inventory and reduce wait times at checkout.
A lot of these measures apply as well to digital retailers, with the added risk of the entire site running slowly or not at all. In 2015, Neiman Marcus experienced an extended outage on Black Friday and Target and PayPal were intermittently down on Cyber Monday.
Etsy is no stranger to this holiday intensity. This is the biggest shopping season of the year for us and we typically receive more site visits than at other times, which translate into more orders. Over the years, our product, engineering, and member-facing organizations have developed practices and approaches to support our community during the intensity of the holidays.
How Etsy Handles the Holidays
The increase in site traffic and transactions impacts many areas of the business. Inbound support emails and Non-Delivery Cases reach a peak in December, and the Trust and Safety team ramps down outside project work and hiring efforts to focus on providing exceptional support.
“Emotions tend to be more heightened around the holidays” says Corinne Haxton Pavlovic, head of the trust and safety team at Etsy, “There’s a lot on the line for everyone this time of year – highs can feel higher and lows can feel lower. We have to really dig into our EQ and make sure that we’re staying neutral and empathetic.”
For our sellers, the holidays can be an exciting but scary time: “the Etsy sales equivalent of Laird Hamilton surfing a 70-ft wave off Oahu’s North Shore” says Joseph Stanek, seller account manager. Stanek works with a portfolio of top Etsy sellers to advise and support their business growth. He’s found that many sellers spend an enormous amount of effort on holiday sales promotion, and are then hit with a record number of orders. They’re pushed to increase their shipping and fulfillment capabilities, which can serve “as a kind of graduation” as they level up to a new tier of business.
With huge numbers of buyers browsing for the perfect gift, and sellers working hard to manage orders, it’s critically important for Etsy’s platform to be as clear and reliable as possible. That’s why the period between mid-October through December 31st is a time to be exceptionally careful and more conservative than usual about how we make changes to the site. We affectionately refer to this period as “Slush.”
The Origins of Slush
The actual term “Slush” is a play on the phrase “code freeze,” which is when a piece of software is deemed finished and thus no new changes are being made to the code base. “Code freezes help to ensure that the system will continue to operate without disruptions,” says Robert Tekiela at CTO Insights. It’s a way to prevent new bugs from being created and and are “commonly used in the retail industry during holiday shopping season when systems load is at a peak.”
Since at Etsy, we still push code during the holidays, just more carefully, it’s not a true code freeze, but a cold, melty mixture of water and ice. Hence, Slush.
According to Jason Wong, Slush at Etsy got started sometime after current CEO Chad Dickerson became CTO in the fall of 2008. As an engineering director of infrastructure, Jason has been a key part of Etsy’s platform stability since joining in 2010. Back then, Etsy’s infrastructure was less robust and the team was still figuring out how to effectively support the already high levels of traffic on Etsy.com. There was not yet a process for managing code deployments during the holidays and the site experienced more crashes than it does today. .
Said Wong, “the question was: during the holiday season, a high traffic, high visibility time when we made a significant portion of our [gross merchandise sales], how do we stabilize the site? That’s where Slush got started.”
Here’s the slightly redacted email from Chad that kicked off the idea of Slush.
From: Chad Dickerson
Date: Fri, Oct 31, 2008 at 4:08 PM
Subject: holiday “slush” (i.e. freeze) — need your input
To: adminName and adminName
Cc: adminName, adminName, and adminName
adminName / adminName,
adminName, adminName, and I met recently to discuss a holiday “freeze” beginning at end of day on November 14. We’re calling it a “slush” because there are certain types of projects that we can still do without making critical changes to the database or introducing bugs. The goal with setting this freeze is to eliminate the distractions of any projects not on the must-do list so we can focus on the most important projects exclusively.
adminName, adminName, adminName, and I met earlier this week to mutually agree on what projects we need to complete before the freeze/slush beginning at the end of the day on 11/14. We came up with a list of “must do” projects that need to get done by 11/14, and “nice to have” projects if we complete the must-dos:
[ Link to Document Detailing Slush Plan ]
There are couple of projects that we’ve discussed already on the list, like Super Etsy Mini and BuyHandmade blog. I wanted to make sure that any “must do” projects in your worlds are reflected and prioritized. On Monday, we’re going to start doing daily standups at 11:30am to track progress against the agreed-upon list of projects leading up to 11/14. Since there will be 9.5 working days to execute, we need to freeze the list itself by Monday. Can you review the list and let us know if you have additions that we are missing that we should discuss? Thanks.
Learning to Build Safely
In the early days, Slush was far more strict, in part because Etsy’s infrastructure was not as robust and mature as it is today. We operated off of a federated database model, which in theory was meant to prevent one database crash from affecting another, but in practice, it was hard to keep clusters from affecting one another and site stability suffered as a result. This technical approach also made it hard to understand what went wrong and how the team could fix it.
Engineers went from deploying five times a day to once a day. Feature flags were tested thoroughly so that major features like convos or add to cart could be turned off without shutting down the site.
Over the past few years, a major effort was made to get Etsy’s one really big box called master onto a sharded database model. With a sharded system, data is distributed across a series of smaller active-active pairs so that if single database goes down, there is an exact replica with the data intact. This system is faster, more scalable, and resilient compared to the prior method of simply storing all the data in one really big box. In 2016, we successfully migrated all of our key databases, including receipts, transactions, and many others, “to the shards” and decommissioned the old master database.
Developing continuous deployment was also a major feat which allowed Etsy to develop A/B testing and feature flagging. These technical efforts, in conjunction with our culture of examining failures through blameless postmortems, have allowed allowed Etsy to get better at building safely. Today our engineering staff and systems are studied by organizations around the world.
Slush Today and Tomorrow
Within the engineering organization, there’s often a senior staff member who helps organize Slush. The role is an informal one, meant to share best practices and encourage the product org to be mindful of the higher stakes of the holidays. Tim Falzone, an engineering manager on Infrastructure, took on this role in 2015 and presented a few slides at the September Engineering All-Hands which highlight the way we handle Slush today.
Today, Slush means that major buyer and seller facing feature changes are put on hold, or pushed “dark,” where they are hidden behind config flags and not shown publicly. Additionally, engineers get more people to review their pull requests of code changes. These extra precautions are taken to ensure that the site runs quickly and with minimal errors or downtime even with the increased traffic.
Falzone says that now, Slush is less about not breaking the site and more about preventing disruptions for members. “You could easily make something that works flawlessly but tanks conversion or otherwise sends buyers away, or is really slow,” he explained. For sellers, managing a huge wave of orders means relying on muscle memory of how Etsy works, which means that the holidays is a bad time to change the workflow or otherwise add friction for our sellers, who often become profitable on their business for the year during this time.
As Etsy grows, Slush will continue to evolve. A more powerful platform also means more points of integration. More traffic means more pressure on parts of the platform.
Even as we work to secure more headroom for our infrastructure and develop tooling to stress-test our systems, we will always be challenged in new ways. Though we’ve come a long way, Slush will continue to be a helpful reminder to move safely during a critical time of the year for our members and our organization.
This is the third post in a series of three about Etsy’s API, the abstract interface to our logic and data.In the last posts we covered how we built a new API framework, and we clearly identified the gains in terms of performance and shared abstraction layer between languages and devices. But how did we make an entire engineering organization switch to the new framework?
How did we achieve the cultural transformation to API first? How do we avoid this being the new thing that everyone knows about, but no one has the time to try?
How did we “sell” it?
In our case, we had multiple strategies that worked together.
The first one was communication: It was simple to write new endpoints, and the gains were very clear: Both the performance gain through concurrent data fetch, and the possibility to share an endpoint with the apps, were huge selling points.
We partnered with the Product organisation to make clear that they needed to include a standard question in their project templates “Is this being built on APIv3? If not, why not?”, which enforced the company strategy to adopt Mobile First development.
We had pilot groups that tried out the new API framework for new product features, we partnered closely with the Activity Feed team to help them adopt APIv3. This resulted in early converts that were strong advocates for wider adoption. The benefits were clear to communicate but also so compelling that we didn’t need to do much to sell teams, they could see what the activity feed had accomplished.
We had evolving documentation about the architecture and how to use the framework, in the API handbook. We had a codelab, which engineers could use for self study. In the lab you learn to build endpoints by building a little library app.
We had workshops in which people did the code lab together with experienced API developers.
We had tooling to learn about the system, such as the distributed tracing tool CrossStitch, the fanout manager informing about the complexity of the fetch, and datafindr for general information about API endpoints and example calls and results.
Once it’s going, too fast to keep up
After the word was spreading, so many people were making new endpoints that it was almost impossible for us to keep up with it. Initially, we alerted on each new endpoint, but had to switch this off due to the rate at which new endpoints were being built.
The biggest motivator for adoption was that we communicated the gains. Everyone became interested in the performance gains and the cross-device-sharing through abstraction, which was a motivating incentive to switch. Also, there was a potential speedup through caching of endpoints.
White space gets filled with code
Immediately after adoption started, we saw some misunderstandings in the code that we did not foresee. Such as magic hiding in traits, inheritance between endpoints, while they contain just declarative static functions, or really complex code in bespoke endpoints, which should be library code. The minimalist code required to write an endpoint didn’t make explicit what is missing by accidental omission vs. deliberate design decisions.
This also lead to confusion about which building blocks of endpoints are mandatory, about how to opt into a service, naming conventions, the location in the file system, and the complete route of an endpoint.
This was a caused some cleanup work for the API team, but in fact it was a good thing, caused by fast and wide adoption of a new system, and is probably inevitable in that case.
Documentation is critical
We addressed the problems by improving our documentation, and creating an interactive endpoint generator that creates a file in the right place, with stub functions, and outputs the route.
Also, we added format checks for endpoints in Jenkins. It was helpful to have the API code lab as a narrative introduction to gain practical experience fast, while also developing the API handbook as a reference manual. Two goals, two documents.
Pockets of Patterns
When we internally announced the project in which we unified the format for how an API endpoint is written, we started getting emails about what developers wanted from the API. And also: Mails about patterns that had emerged within the API framework without our knowledge. People had found and implemented solutions for their specific use cases. Our discussions had opened up a space to evaluate those patterns, and share them with all API endpoint developers!
We’ve learned that we need to stay on top of new patterns as they emerge, and keep talking to developers about their needs. A subtle, but powerful paradigm shift:
We, as the API framework developers, listen to endpoint developers and address everyone’s needs by evolving the framework. Instead of having developers using our API framework as a service, we are serving them.
Also, we learned to trust our fellow developers even more! We underestimated their curiosity and willingness to try out new things, we underestimated the adoption rate, and we also underestimated how creative they would be with finding solutions for their specific use cases that we did not plan for. What an awesome surprise. 😀
Types too late + types too loose D:
Also, we underestimated everyone’s willingness to do the work of typing their endpoint result. In the beginning, we made specifying a resultType optional, because we feared making it mandatory would slow the adoption. And if a type was specified, we only sampled the type errors, to not make correct typing a hurdle during API endpoint development, but rather a “nice to have” hint for when things go wrong. Not a guarantee. In retrospect, we could have saved ourselves a lot of extra work if we had made the resultType mandatory, and if we had made the type errors more prominent from the beginning.
Etsy’s developers are generally happy about result types, they helped to implement a coarse grained gradual typing, and actively pushed us towards making the result types mandatory, to rely on them as a guarantee.
Work in progress
There remain four open problems we all touched upon:
- Active Cache Invalidation – it’s hard.
- Caching the API at different geographic locations (Edge)
- opening up v3 to 3rd party developers
A question that comes with the developer adoption question is: What are we thinking in terms of preparing for third parties? This is a valid question, and we did not fully answer it yet, because we switched to generated client code in the meantime.
Even though we started announcing the new API v3 on our developer mailing lists, all third party apps are still on version 2. The platform for v3 is ready to open up to 3rd parties, as soon as Etsy as a whole decides to make the switch.
What did we learn? How were we transformed?
This story is a case study of how the API first approach transformed the architecture, and also transformed how we work with and think about our API at Etsy. It covers a lot of ground and is influenced by our infrastructure and history. The story of architectural decisions and adoption is transferable to other systems though. So what did we learn from the decisions? Or from the decisions not being made yet? Were there surprises?
We learned that we can seamlessly grow a system from a hack day project to a live system. Over time, a domain specific language of endpoints evolved, and the system grew according to the endpoint developers needs.
Also, we learned a great deal about cURL or the according php extension! Not only does it allow to make parallel requests, but it also let’s us check on, control and modify the in-flight requests in a non-blocking way.
Another realization is that we should have thought about caching early on. We added it in a hurry and are still working on active cache invalidation. So far, we can only use timeouts, which limits the class of endpoints that can be cached. Also, we have to think some more about variations due to different locales, which might only vary for parts of the response.
A huge positive surprise was the HHVM experiment. By teeing traffic and trying out a completely new system, we solved our performance problems to this day.
The textbook approach says: design APIs contract first. As we have seen, we circumvent that part by using automated client generation. Components help us with abstraction, but even the bespoke layer is very malleable. This is an interesting trick.
Another surprising learning is that we should trust our developers not only to adopt and adapt to the new API framework, but also to go all the way and make typing of endpoints mandatory from the start.
Also, we should keep the conversation ongoing to find out how the API framework can serve developers and their needs, instead of just designing it as a rigid service. We even found solutions that they had created themselves, and the framework could officially adopt them for everyone’s benefit.
Did it work? How did it end? Did we succeed?
Let’s assess this with a quote from Etsy’s Andrew Morrison, from before all this work started:
“We desperately need to figure out a scheme for allowing concurrency or else we’re going to have performance problems forever.”
YUP! We solved this, and a few unexpected things along the way. Despite initial problems with the extra layer, we figured out unconventional solutions by experimenting and organically growing the system towards the developers needs.
What we are discussing now is how to shift some of the complexity back towards the client. Maybe GraphQL is an interesting approach? Right now, the clients don’t know how the queries will be executed. This makes sense if you have services and teams and clear cut boundaries and interfaces, and if you have a contract in form of an API. Our approach is currently not structured like that.
But could we compile an alternative, more knowledgeable PHP client, that lifts the composability from the HTTP layer into the API consumer code, in cases where we are our own consumer and create a website of the same tree structure? It’s clear in this case how many endpoints we are calling, and as long as it’s us, it’s safe to lift this control beyond the client, into the view.
This is the third post in a series of three about Etsy’s API, the abstract interface to our logic and data.
‘Crashcan’ (think trashcan, but for crashes) is Etsy’s internal application for mobile crash analytics. We use an external partner to collect and store crash and exception data from our mobile apps, which we then ingest via their API. Crashcan is a single-page web app that refines and better exposes the crash data we receive from our external partner.
Crashcan gives us extra analysis of our crashes on top of what our partner offers. We can make less cautious assumptions about our stack traces internally, which allows us to group more of them together. It connects crashes to our user data and internal tooling. It allows us to search for crashes in a range of versions. It’s provided a good balance between building our own solution from scratch and completely relying on an external partner. We get the ability to customize our mobile crash reporting without having to maintain an entire crash reporting infrastructure.
Error Reporting – The Web vs. Mobile
Unfortunately, collecting mobile crashes and exceptions is (of necessity) quite different from most error reporting on the web, in several ways:
- On the web (especially the desktop web), we can be fairly confident that a user is online – they’re less prone to flaky or slow connections. Plus, users don’t expect to be able to access the web when they don’t have a connection.
- ‘Crashes’ are very different on the web, so many exceptions and errors are less severe. Sure, users may need to refresh a page, but it’s rare that a web page will crash their browser.
- We can watch graphs and logs live as we deploy on the web (hooray, continuous deployment!) – and it’s clear if hundreds or thousands of exceptions start pouring in. With our mobile apps, however, we have to wait for users to install new versions after a release makes it to the App Store. Only then do we get to see exceptions.
- With mobile, when a crash occurs, we normally can’t send the crash until the app is launched again (with a data connection) – in some cases, this can be days or weeks.
- App crashes are costly to the user, as the app crashing loses the user’s state. On the web, even if a page breaks for some reason, the user keeps their browser history.
- With the web, there’s one version of Etsy at any point in time. It’s updated continuously, and every user is running the latest version, always. With the apps, we have to deal with multiple versions at once, and we have to wait for users to update.
With these differences, it’s been important to approach the analysis of crashes and exceptions differently than we do on the web.
A Crash Analytics Wish List
Many of the issues mentioned above were handled by our external partner. But while this external partner provides a good overview of our mobile crashes, there were still some bits of functionality that would make analyzing crashes significantly easier for us. Some of the functionality we really wanted was:
- An easy way to filter broad ranges of app versions – like being able to specify 4.0-4.4 to find all crashes for versions between 188.8.131.52 and 4.4.999.999.
- Links between users’ accounts and specific crashes – like “This user reported they experienced this crash… Let’s find it.” This coupling with our user database allows us to better determine who is experiencing a crash – is it just sellers? Just buyers?
- Better crash de-duplication, specifically handling different versions of our apps and different versions of mobile operating systems. For example, crash traces may be almost identical, but with different line numbers or method names depending on the app version. But if they originate in the same place, we want to group them all together.
- Crash categorization – such as NullPointerExceptions versus OutOfMemory errors on Android – because some types of crashes are fairly easy to fix, while others (like OutOfMemory errors) are often systemic and unactionable.
- Custom alerting with various criteria – like when we experience a new crash with this keyword, or when an old version of our app suddenly experiences new API errors.
It seemed like it’d be fairly straightforward to build our own application, using the data we were already collecting, to implement this functionality. We wanted to augment the data we receive from our external partner with data and couple it with our own internal tooling. We also wanted to provide any interested Etsy employees with a simple way to view the overall health of our apps. So that’s exactly what we chose to do.
Crashcan’s structure was a pretty wide-open space. All it really needed to do was provide crash ingestion from an API, process the crash a bit, and expose it via a simple user interface (it sounds a lot like many technologies, actually). So while the options for technologies and methodologies were open, we ultimately decided to keep it simple.
By using PHP, Etsy’s primary development language, we keep the barrier to entry for developers at Etsy low . We used as much modern PHP as possible, with Composer handling our dependency management. MySQL handles the data storage, with Doctrine ORM providing the interface to the database.
Ingesting the data was the first hurdle. Before handling anything else, we needed to make sure that we could actually (1) get the data we wanted and (2) keep up with the number of crashes that we wanted to ingest, without breaking down our system. After all, if you can’t get the data you want and you can’t do so reliably, there’s really no point.
After analyzing the API endpoints we had at our fingertips (yay, documentation!), we determined that we could get all the data we wanted. The architecture needed to allow us to:
- Determine whether we already have a crash (regardless of whether it has been deduplicated on our end)
- Keep track of deduplicated crashes, and link them to the originating crash from the external provider
- Run complex queries to combine data
- Analyze whether crashes are meeting specific thresholds, like whether a new crash has occurred at least n times
- Count crashes by category
- Filter everything by version range and other criteria
In the end, we developed a schema that allowed us to fulfill all those needs while remaining quick in response to queries:
To actually ingest the data from our external provider, we run a cron job every minute that checks for new crashes. This cron runs a simple loop – it loads new crashes from a paginated endpoint, looping through each page and each crash in turn. Each crash is added to a queue so that we can update it asynchronously.
We run a series of workers that run continuously, monitoring the queue for incoming crashes. As these workers run, they each pick a crash off the queue and processes it. This includes several steps, first checking whether we have the crash already, then updating it if we have it or creating a new crash if we don’t. We also go through each crash’s occurrences to make sure that we’re recording each one and tying it to an existing user if one exists. The flowchart below demonstrates how these workers process crashes.
Monitoring & Alerting
After building Crashcan’s initial design and getting crashes ingesting correctly, we quickly realized that we needed utilities to monitor the data ingestion and to alert us when something went wrong. Initially, we had to manually compare crash counts in Crashcan with those that our external provider offered in their user interface. Obviously, this was neither convenient nor sustainable, so we began integrating StatsD and Nagios. To check that we were still ingesting all our crashes, we also wrote a script to perform a sort of ‘spot-check’ of our data against our external provider’s – which fails if our data differs too much from theirs.
We created a simple dashboard, linked to StatsD, that allows us to see at-a-glance if the ingestion is going well – or if we’re encountering errors, slowness, or hitting our API limit. While we plan to improve our alerting infrastructure over time, this has been serving us well for now – though before we got our monitoring in a good state, we hit some big snags that kept us from being able to use Crashcan for weeks at a time. There’s an important lesson there: plan for monitoring and alerting from the beginning.
When deciding on Crashcan’s structure, we decided to focus first on building a stable, straightforward API. This would enable us to expose our crash data to both users and other applications – with one interface for accessing the data. This meant that it was simple for us to build Crashcan’s user interface as a Single Page Application. Very few of the disadvantages of single page applications applied in Crashcan’s case, since our users would only be other Etsy employees. Building a robust API also enabled us to share the data easily with other applications inside Etsy – most especially with our internal app release platform.
When an Etsy engineer accesses Crashcan, we aim to present them with the most broadly applicable information first – the overall health of an app. This is presented through an overview of frequent crashes, common crash categories, and new crashes, along with a graph showing all crashes for the app split out by version. This makes it much easier to spot big new crashes or problematic releases. The engineer then has the option to narrow the scope of their search and view the details of specific crashes.
While we’ve finished Crashcan v1 with much of the core functionality and gotten it in a stable enough state that we can depend on its data, there’s still quite a bit that we’d like to improve. For example, we haven’t even begun to implement a couple of the items we mentioned in our wish list, like custom alerting. Second, the user interface could do with some bugfixes and refinement. Right now, it’s in a mostly-usable state that other Etsy engineers can at least tolerate, but it’s not stable or refined enough that we’d be comfortable releasing it to a wider audience.
Additionally, our crash deduplication is still rudimentary. It only performs simple (and expensive) string comparisons to find similar crashes. We’d like to implement more advanced and more efficient crash deduplication using crash signature generation. This would give us a much more reliable way of determining when crashes are related, therefore providing a more accurate picture of how common each crash is.
Most of the pain points in Crashcan’s development weren’t new or especially unexpected, but they serve as a valuable reminder of some important considerations when building new applications.
- Build with monitoring and alerting in mind from the beginning. We could’ve avoided a several-week-long lapse in functionality had we focused on building in monitoring from the beginning.
- Don’t be afraid to consult with others on structural or technical decisions, and then just make a decision. It’s something that I’ve always struggled with – but getting blocked on making decisions or digging too deep into the minutiae of every decision is a great way to waste time.
- Document your assumptions – especially when dealing with APIs – as small assumptions can turn into big deals later on. This is what led to our biggest failure – we mistakenly assumed that crash timestamps were accurate. When a crash said it had occurred 4 days in the future, our app stopped updating, because it was checking only crashes occurring after that crash.
External search engines like Google and Bing are a major source of traffic for Etsy, especially for our longer-tail, harder to find items, and thus Search Engine Optimization (SEO) is important in driving efficient listing discovery on our platform.
We want to make sure that our SEO strategy is data-driven and that we can be highly confident that whatever changes we implement will bring about positive results. At Etsy, we constantly run experiments to optimize the user experience and discovery across our platform, and we therefore naturally turned to experimentation for improving our SEO performance. While it is relatively simple to set up an experiment on-site on our own pages and apps, running experiments with SEO required changing how Etsy’s pages appeared in search engine results, over which we did not have direct control.
To overcome this limitation, we designed a slightly modified experimental design framework that allows us to effectively test how changes to our pages affect our SEO performance. This post explains the methodology behind our SEO testing, the challenges we have come across, and how we have resolved them.
For one of our experiments, we hypothesized that changing the titles our pages displayed in search results (a.k.a. ‘title tags’) could increase their clickthrough rate. Etsy has millions of pages generated off of user generated content that were suitable for a test. Many of these pages also receive the majority of their traffic through SEO.
Below is an example of a template we used when setting up a recent SEO title tag experiment.
We were inspired by SEO tests at Pinterest and Thumbtack and decided to set up a similar experiment where we randomly assigned our pages into different groups and applied different title tag phrasings shown above. We would measure the success of each test group by how much traffic it drove relative to the control groups. In this experiment, we also set up two control groups to have a higher degree of confidence in our results and to be able to quality check our randomized sampling once the experiment began.
We took a small sample of pages of a similar type while ensuring that our sample was large enough to allow us to reach statistical significance within a reasonable amount of time.
Because visits to individual pages are highly volatile, with many outliers and fluctuations from day to day, we had to create relatively large groups of 1000 pages each to expect to reach significance quickly. Furthermore, because of the high degree of variance across our pages, simple random sampling of our pages into test groups was creating test groups different from each other in a statistically significant way even before the experiment began.
To ensure our test groups were more comparable to each other, we used stratified sampling, where we first ranked the the pages to be a part of the test by visits, broke them down into ntile groups and then randomly assigned the pages from each ntile group into one of the test groups, ensuring to take a page from each ntile group. This ensured that our test groups were consistently representative of the overall sample and more reliably similar to each other.
We then looked at the statistical metrics for each test group over the preceding time period, calculating the mean and standard deviation values by month and running t-tests to ensure the groups were not different from each other in a statistically significant way. All test groups passed this test.
Estimating Causal Impact
Although the test groups in our experiment were not different from each other at a statistically significant level before the experiment, there were small differences that prevented the estimation of the exact causal impact post treatment. For example, test group XYZ might see an increase relative to control B, but if Control B was slightly better than test groups XYZ even before the experiment began, simply taking the difference between of the two groups would not be the best estimate of the difference the treatment had effected.
One common approach to resolve this problem is to calculate the difference of differences between the test and control groups pre- and post-treatment.
While this approach would have worked well, it might have created two different estimated treatment effect sizes when comparing the test groups against the two different control groups. We decided that, instead, using Bayesian structural time series analysis to create a synthetic control group incorporating information from both the control groups would provide a cleaner analysis of the results.
In this approach, a machine learning model is trained using pre-treatment data to predict the performance of each test group based on its covariance relative to its predictors — in our case, the two control groups. Once the model is trained, it is used to generate the counterfactual, synthetic control groups for each of the test groups, simulating what would have happened had the treatment not been applied.
The causal impact analysis in this experiment was implemented using the CausalImpact package by Google.
We started seeing the effects of our test treatments as soon as a few days after the experiment start date. Even seemingly very subtle title tag changes resulted in large and statistically significant changes in traffic to our pages.
In some test groups, we saw significant gains in traffic.
While in others, we saw no change.
And in some others, we even saw a strong negative change in traffic.
The two control groups in this test showed no statistically significant difference compared to each other after the experiment. Although a slight change was detected, the effect did not reach significance.
Post-experiment rollout validation
Once we identified the best performing title tag, the treatment was rolled out across all test groups. The other groups experienced similar lifts in traffic and the variance across buckets disappeared, further validating our results.
The fact that our two control groups saw no change when compared to each other, and also the fact that the other buckets experienced the same improvement in performance once the best performing treatment was applied to them gave us strong basis for confidence in the validity of our results.
It appeared in our results that shorter title tags performed better than longer ones. This might be because for shorter, better targeted title tags, there is a higher probability of a percentage match (that could be calculated using a metric like the Levenshtein Distance between the search query and the title tag) against any given user’s search query on Google.
In a similar hypothesis, it might be that using well-targeted title tags that are more textually similar to common search terms helps to increase percentage match to Google search terms and therefore improves ranking.
However, it is likely that different strategies work well for different websites, and we would recommend rigorous testing to uncover the best SEO strategy tailored for each individual case.
- Have two control groups for A-A testing. This allowed us to have much greater confidence in our results.
- The CausalImpact package can be used to easily account for small differences in test vs. control groups and estimate the differences of treatments more accurately.
- For title tags, it is most likely a best practice to use phrasing and wording that would maximize the probability of a low Levenshtein distance match from popular target search queries on Google
Visualization of Stratified Sampling
This post is based on a talk and workshop that Toria and Ian gave at Etsy’s Dublin office in August.
Etsy has a strong set of beliefs that underpins our engineering culture. We believe in code as craft. We believe that if it moves, you should graph it. And we believe that when you’ve got some working code ready, you should “just ship” it.
This practice of “just shipping” is known as continuous deployment. We make small changes frequently, and we hide them behind “config flags” that let us test our work incrementally before a full feature launch. Etsy engineers collectively deploy code to the production site as many as 70 times per day.
Now imagine for a minute that you’re an engineer at an organization doing continuous deployment. You’ve got a small change ready to deploy. Your code is good. Tests pass. It’s all been reviewed. But every time you try to deploy, something goes wrong. This happens all the time, but only to you. Every time you try to deploy, you have to spend half an hour trying to fix the deploy system. No one else is motivated to fix anything because it works just fine for them. The deploy system is better for everyone because of your investigations, but fixing the deploy system isn’t part of your job. You just want to ship code!
What would be great is if some other engineers would pitch in and do the work too, so that you have more time to do your actual job. What you need are allies.
Surprise! That was a thinly-veiled metaphor for what it feels like to be a member of an underrepresented group trying to improve their work environment. Relying on members of minority groups to shoulder the burden of diversity issues is just as flawed as expecting one person to do all the work to fix a broken deploy system. You can’t excel at your job when you spend half your time dealing with other stuff. We need ways of spreading the load. We need allies. And we hope that’s why you’re reading this now.
So what is an ally? Let’s start by defining some important terms so that we’re all on the same page.
Women, men, and non-binary people
At Etsy, we recognize that gender is non-binary: it lies on a spectrum. When we use the term “men” here, we’re talking about anybody who identifies as a man and experiences the benefits of male privilege. When we say “women”, we’re talking about anybody who identifies as a woman. Some people don’t fall into either of these categories: they are non-binary. Gender discrimination impacts these people too, and as such you’ll see references to them throughout this post.
Much of the discrimination that people face depends on how society identifies their gender, rather than how they themselves identify. A person with a beard is likely to be treated like a man regardless of their chosen gender, but they still have to deal with bias and prejudice in their daily life.
“Feminism is the radical notion that women are people.” — Marie Shear
Of course women (and non-binary folks) are people. But while we think “of course they’re people”, we tend to overlook the countless ways in which society as a whole undervalues women and their work: lower wages for the same work or overall lower wages in industries dominated by women, portrayals of women as prizes to be won or objects to have, and the ignoring or ridiculing of problems faced by women, to name but a few.
As you learn more about feminism, another term you’ll see is “intersectional feminism”. Intersectionality is the recognition that people are complex beings with multiple axes of identity. Although we’re talking primarily about gender here, a person’s identity is not solely defined by their gender. Intersectional feminism acknowledges that we can’t solve problems for all women without considering that women have different experiences based on their race, religion, sexuality, gender expression, or able-bodiedness.
Good news! Allyship is also intersectional! If you’re white, you can serve as an ally to people of color. If you can see, you can serve as an ally to people with vision loss. If you’re a man, you can serve as an ally to women. If you’re cisgender, you can serve as an ally to folks who are trans, non-binary, or genderqueer.
Consider intersectionality throughout this post. Ask yourself how these techniques for allyship can be applied for other underrepresented groups.
The idea of privilege is often a massive stumbling block for people. We rebel against the idea that we have had an unfair advantage in life. “I had to work hard,” you’ll hear people claim. “I’ve struggled for everything I’ve got.”
Privilege does not mean you had it easy. It means you had it easier. If a man grows up in poverty, and drags himself out of it, that’s impressive. That’s hard. If he’d been a woman, he’d have had to do all the same things, while also fighting society’s expectations of what women can or should do. Privilege is what you don’t have to deal with.
In the opening example, everyone else ships more than you—not because they’re better than you, but because they don’t have to deal with the additional nonsense that you do.
Understanding privilege—and understanding and accepting your own privilege—is a vital part of becoming an effective ally. You’re not being asked to beat yourself up about it, you’re being asked to empathize with others who are less privileged so that you can do something about it.
Along with “privilege”, “patriarchy” is another term that trips people up. It brings to mind a shadowy cabal of men pulling strings and malevolently excluding women. This is… silly.
Instead, the term “patriarchy” refers to structural sexism and gender discrimination. We are raised in a society that historically and systematically favors men over women. This colors everything we do and everything we see. We’re surrounded by the fruits of this bias, steeped in it from birth. Just one example: studies show that, from an early age, girls are held to higher standards of politeness, while boys are expected to speak dominantly and assertively, producing power imbalances in conversations that continue through to our adult interactions.
Patriarchy perpetuates itself. Not through conscious malevolence (most of the time), but because male-dominated power structures tend to stack the deck against women gaining power, and so produce more male-dominated power structures.
The perpetuation of the patriarchy is rooted in unconscious bias. These are biases we don’t even realize we have, but which influence how we think and act. They are instilled in us over the years by repetitive stimuli from our environment.
Consider the following story: “A man and his son were in a car accident. The man died on the way to the hospital, but the boy was rushed into surgery. The surgeon said: ‘I can’t operate! That’s my son!’”
The first time most people are presented with this, they fail to realize the surgeon is the boy’s mother. Mental blind spots like this one show that we are all a little bit sexist. (As a side note, this thought experiment has been around for many years. In recent years, respondents have often thought the surgeon was the boy’s other father. They are more willing to accept a gay male couple than a female surgeon.)
Dr. Catherine Ashcraft from the National Center for Women and Information Technology (NCWIT) gave a lecture at Etsy on unconscious bias. She talked about some experiments for quantifying gender bias. The NCWIT staff took these tests, and all the participants were found to be unconsciously biased against women. To repeat: women, working for the National Center for Women and Information Technology, working to bring gender diversity to our sector, were all biased against women.
We are all trained, over time, to have these habitual, instinctive responses to situations. When these unconscious biases are challenged, we tend to react negatively. For example, women who adopt more traditionally male behaviors and speech patterns in the workplace are often perceived more negatively than women who fit society’s expectations.
What we can do, however, is make conscious corrections. We can actively try to overcome these unconscious biases.
The best way to combat unconscious bias is to recognize that it exists and identify when it’s happening. It’s easy to identify and “call out” overtly sexist behavior, but what about the more subtle and ambiguous stuff?
Casual phrases like “you’re really good at sports for a girl!” or “going out with the guys tonight; leaving the old ball and chain at home!”, using gendered phrases like “the ops guys”, speaking over women in meetings, repeating their ideas as your own, expecting them to do clerical work like note-taking, or standing over them at a desk in a dominant position: these are examples of microaggressions. They’re the “little things” that, examined individually, don’t always seem like a big enough deal to make a fuss over. “Maybe it was a joke?” “Maybe he didn’t mean it that way?” “It’s just an expression!”
But microaggressions are cumulative. Over time, these subtle comments build and reinforce traditional power structures by reminding women and non-binary individuals of their position in society.
We must notice these subtle, often unconscious microaggressions in others—and in ourselves—in order to correct them.
And that brings us to “ally”: the key part of this post. An ally is a member of a privileged group (in this case, men) who works to enable opportunity, access, and equality for members of a non-privileged group (in this case, women and non-binary people). They are using their privilege, their advantages, to bring about change.
How can allies help?
So… centuries—millennia!—of systematic discrimination against women. Biases baked into us from birth! Society fundamentally biased against women! This is an overwhelming problem. It’s hard to know where to start.
Like any large, complex problem, begin by breaking it down into smaller, more manageable parts. Start at your workplace. If you can make a difference there, you not only improve the lives of the less-privileged people you work with, but you also improve your working environment. Research shows that more diverse teams, with more diverse perspectives and experiences, make better decisions and build better products.
Start today. You’ve read this far, so you’re already interested in making a difference. Don’t wait until you’re an “expert” on feminist theory to start speaking up. Just start trying. And just like with continuous deployment, when you mess up (and everyone does), get feedback, listen, learn, fix the problem, and try again.
Now you’re ready to start, but as a member of a privileged group, what can you do? What do allies offer?
- Power and authority. In male-dominated power structures, men tend to have more powerful and influential positions and voices. Use those voices to speak on behalf of those with less privilege.
- Access. Our networks tend to look like ourselves, so men tend to have networks full of other (powerful) men. Provide access to these networks.
- Amplification. In addition to speaking up on behalf of women and non-binary individuals, men should amplify and endorse their words and achievements.
- Modeling. Allies model good behavior and interactions, such as talking openly with others about gender discrimination or being vocal about addressing their unconscious biases.
- Teaching. Expose other privileged people to the concepts you learn about. People from marginalized groups are often expected to do the teaching and allies can share the load.
Ten Steps to Being An Effective Ally
Being an ally is a constant learning experience. Being an ally isn’t a fixed state, it’s not a badge you earn (or take) and sew onto your sleeve and you’re an ally from then on. Being open to feedback and demonstrating that you’re willing to accept and learn from criticism is vital. More than anything, “ally” is a status accorded to you by those that you’re trying to help, based on your words and actions.
So, how do we ally?
1. Educate yourself
There are a ton of resources out there for you to learn from. Make the effort to educate yourself, rather than demanding that marginalized people explain things to you. You wouldn’t ask Rasmus Lerdorf, inventor of the PHP programming language, to explain basic PHP concepts. You would Google it. You would go out and find the articles, tutorials, and forum threads that already exist for beginners. There is already material for you to learn from: go out, find it, and read it. (We’ve created a reading list that would make a great starting point.)
While you’re reading, be aware that feminism isn’t a monolithic block of thought. There are a wide variety of viewpoints on the topic. Be sensitive to the possibility that what you’ve learned is just one viewpoint.
As an ally, you will never stop learning. Keep actively seeking out new writing and material so that you can deepen your understanding.
2. Expand your network
A great way to expand your understanding of feminism and gender issues is to expand and diversify your network. Make sure to follow your female and non-binary colleagues on social media. Then, make a habit of following the other folks they retweet or mention.
If you’d like to introduce yourself to a woman at your workplace or at a conference, do so! Just remember to keep the discussion technical and on-topic: talk to them because you’d like to know more about that new machine learning model they implemented, not because you need more diverse friends.
3. Listen and believe
Now that you have a good number of women and non-binary folks in your network, listen to them! Arguably the biggest thing you can do as an ally is to listen. Listen to the stories of the difficulties they’ve faced and the problems they’re experiencing in the workplace. When you hear their stories, especially ones that don’t fit with your mental model of your workplace or environment, believe them. No “aren’t you over-reacting?” No “I think you’ve misunderstood.” If they tell you there’s a problem, there’s a problem. So listen.
After listening, ask how you can help. Ask how you can support them in resolving the problem. It doesn’t have to be you doing the solving—your colleagues aren’t helpless damsels in distress—but your support can be invaluable.
One of the most difficult things to listen to is criticism of yourself and your actions. You still need to listen and believe and learn.
But just because you haven’t been told there’s a problem, that doesn’t mean there isn’t one. Speaking about experiences of discrimination is often very difficult, because it tends to be very, very risky. Marginalized people who report discrimination often find that doing so negatively impacts their careers. When they raise issues, they get labeled complainers or trouble-makers, while those they complain about see no consequences or repercussions for their actions.
Remember that you have no reason to expect that they will share their stories or concerns with you. These are not conversations that men, even well-meaning allies, should initiate. Don’t ask for these conversations, but when they happen, listen and believe.
4. Notice the small stuff
Your colleagues aren’t going to tell you about every bad experience; in fact, they won’t tell you about most of them. You can help by noticing problems by yourself and addressing them.
Microaggressions—the small stuff—are some of the most subtly toxic behaviors that women and non-binary people have to deal with. Microaggressions slowly eat away at their self-confidence and patience.
When you see some small inequity, mention it. If a colleague interrupts a woman, say, “I’d like to hear what __ was saying”. If a colleague assumes a woman will take notes, say, “I think __ could have some useful insights on this topic—could somebody else take notes so that she can participate more actively?” or possibly, “Have we considered a formal note-taking rotation to ensure that we’re not making gendered assumptions about who will do the clerical work?”
Try to also consider whether your comment will put the colleague who suffered the inequity in an uncomfortable position. If you’re not sure what to do, you should wait, talk to them privately, then defer to their decision on what further action should be taken. They may wish for you to speak with the person directly on their behalf or they may prefer for you to go to a manager. They may not want you to do anything at all (perhaps because they have plans to address this on their own). Sometimes the very recognition of the microaggression is enough! Remember: they don’t need you to save them, but your support and validation can be very valuable.
If you raise things like this with a colleague, it may feel “nitpicky”. It will certainly feel uncomfortable. But many productive and important conversations are uncomfortable! As an ally, you should be prepared to shoulder a bit of the discomfort and awkwardness that women and non-binary people experience every day.
You can also work on “anti-microaggressions”—small acts to nudge the culture in the opposite direction. Examples might include making sure a diverse range of people are featured in your illustrations, slide decks, user stories, etc., or that you pay attention to gendered language in your tools.
5. Teach others
Another way you can share the load is by teaching. Women and non-binary individuals are constantly expected to teach others about feminism and gender issues. It can be a great burden. Help them out by doing some of the teaching.
Have honest conversations with the people you work with, particularly if you observe behaviors that you know (or suspect) may have a discriminatory effect on your other colleagues. Remember that, most of the time, these behaviors are unconscious, or learned in different work environments. Talking about negative behaviors without blame and educating the men you work with helps them become better colleagues, and in the vast majority of cases it’ll be well-received.
In addition to educating other men, encourage them to speak up if they see instances of bias. The more men there are working on this, the easier it will be to make your workplace a more egalitarian environment.
6. Amplify and endorse
There is no point in having equality of numbers if there is no equality of influence. As such, we have to make sure that people from underrepresented groups are heard in meetings, that they have a chance to speak, and that their views are considered and respected. The frustration of not being able to contribute, or being ignored or belittled, is a fast track to quitting.
One type of unconscious bias is called “listener bias”. We are socialized to think that women talk more in general, and so tend to significantly overestimate the actual amount of time women spend talking in discussions, to the extent that we can think that women are dominating conversation when in fact men are doing most of the talking. As always, be aware of this unconscious bias. Correct for it by inviting your female and non-binary colleagues to offer their opinion in a meeting.
Make sure that women and non-binary individuals in your company have the opportunity to work on high-profile projects. If you make staffing decisions, pay attention to gender bias when considering who gets what role. If you’re not making those decisions, you can still advocate and lobby for them within your organization. Support and encourage them, but don’t micromanage them, or do all the work for them. Trust their expertise. You hired them, so they must be talented. If you don’t make use of their talents, not only do you lose out in the short term, but they’ll also eventually quit and you’ll lose out massively in the long term.
Also make sure they get credit for their accomplishments and contributions. Make sure they get to brag about what they’ve achieved. Approve of this behavior, rather than branding them as arrogant or conceited. Remember that society tends to consider modesty a virtue for women, but not men.
Amplify their voices outside of your workplace, too. If you’re invited to speak in public, ask yourself if there’s a woman or non-binary individual equally—or more!—qualified to speak on the topic. Pay attention to gender balance in panels and speaker line-ups at conferences you’re planning to participate in. Ask the organizer why their panel lacks diversity. Ask to see their Code of Conduct, and if they don’t have one, encourage them to change that. Consider not attending events without a code of conduct or refusing to sit on a panel that only includes men.
Social media is another excellent way to increase the visibility of underrepresented genders. If you’ve followed the advice earlier, you’re following women and non-binary folks on social media and have diversified your network, but consider also retweeting and promoting them. If they share a blog post, consider retweeting them instead of writing your own tweet with a link to the same content. Amplify their voices. Even small acts like retweeting can greatly increase their visibility and introduce your followers to more diverse opinions and ideas.
7. Recruit fairly
You know what else helps with gender diversity? Having more diverse people on staff! This might feel like it’s easier said than done, but there are concrete steps you can take to increase the gender diversity of your team.
The first step, which we’ve already addressed, is expanding your network. We tend to do a lot of recruitment from our personal networks, so having a diverse network can make a tremendous impact on the variety of candidates we can recruit.
Take the time and effort to review your job postings for gendered language: could your words make someone feel excluded or unqualified? Look at where your jobs are advertised: are you going to reach a diverse audience?
After you’ve established a diverse pool of applicants, you need to make sure the rest of the process is as fair and unbiased as possible.
When reviewing résumés, be explicitly aware of your unconscious biases to make sure you don’t filter candidates out for the wrong reasons. This doesn’t mean you’re purposely rejecting someone just because you think they’re female. Rather, you might reject someone because they haven’t described their accomplishments the way you might expect. Remember: women are conditioned to be modest and may under-report all the good stuff they’ve done.
There may be other reasons why they don’t conform to your preconceptions of the “ideal candidate”. For example, maybe you’d expect someone with their experience to have a long history of giving conference talks, but they haven’t been speaking at conferences because they perceive conferences as hostile environments.
When it comes to interview time, be mindful of the fact that there are a myriad of ways to be a successful employee and different candidates will excel in different environments. Using a diverse set of interview styles is beneficial for all candidates. Not everyone does well in aggressive “knowledge test”-style interviews. Some are better on a whiteboard, some are better at a keyboard, others respond well to discussion.
This is not to say that we should lower the bar for recruitment; rather, we should accept that we may be using the wrong measuring stick. Expecting everyone to act and respond in a particular way is the very opposite of recruiting for diverse viewpoints and experiences.
On the subject of recruiting women, it’s worth addressing “the pipeline problem”. This is the idea that we can’t hire more women because women aren’t studying computer science. This is somewhat correct, but entirely misleading. Women are not achieving computer science degrees at the same rate as men, it’s true, but the number of women active in the industry is much lower than the total number of women with relevant degrees (and that’s not counting the women who are capable self-taught programmers). Today, women earn 18% of CS degrees. In 1984, they earned 37% of CS degrees. These women are only in their 50s and still active in the industry. What happened to them? Clearly, the pipeline is not the only problem.
What good does it do us if we hire a load of great women and non-binary people, then they all quit because they arrive in a toxic work environment? What if the pipeline leads to a sewage plant?
8. Model and support sustainable work
In tech, particularly, women quit the industry completely with much higher frequency than men. They often leave not just because of sexist behaviors directly, but for a variety of complex reasons.
The expectations of the workplace can place an unreasonable load on all employees. Men are generally expected to meet those demands at the expense of family and personal life, while women are expected to do the opposite. The assumption that women will not have the time to meet these unreasonable demands is one way that society justifies the wage gap. Then, if a couple decides that one of them should stay home to care for the family, who do you think typically quits their job? The woman! Because we pay her less! But we pay her less because we expected her to leave!
In order to keep women in the industry, we need to pay them equally. More than that: we need to create a culture that supports sustainable work in a way that doesn’t pit employees’ personal and professional lives against each other. In doing so, companies invest in employees’ overall health, happiness, and engagement in their work. Your company may have unlimited vacation time or flexible working arrangements, but do your employees feel comfortable actually using those benefits?
Allies can help by actively participating in and supporting a sustainable work culture. They can normalize behaviors such as taking vacation, taking time for family, not working all hours, etc. Etsy’s CEO Chad Dickerson, for example, took full advantage of Etsy’s parental leave benefits (5 weeks at the time, now 26 weeks for all new parents) to help care for his family. More leaders should demonstrate that you can lead robust personal and professional lives that can enhance and support each other.
People of all genders should certainly still be able to opt out of the workplace to concentrate on their families, but that should be a choice, rather than an ultimatum.
9. Don’t lead, follow
Allies are there to share the load, not to take the lead. Allies simply haven’t lived the same experiences as those with whom they are allied. No amount of listening and learning will give you first-hand understanding of a person’s experiences!
Men are typically used to leading and taking charge, but women and non-binary individuals are perfectly capable of fighting their battles and defending themselves: they don’t need a man to step in to save them. What they need from men is support and understanding to make it easier, and for men to do their part so that eventually those battles don’t have to be fought in the first place.
10. Show up
Show up. Every day. Allyship isn’t something you can do in your spare time or only when it’s convenient for you. It’s effort, it’s work—often hard work. Show up, every day, and don’t let it slip.
Showing up includes a healthy dose of self-reflection and self-awareness. Think carefully about your own actions and behaviors—remember that unconscious bias is deeply entrenched and will rear up when you least expect it.
And don’t stop at supporting women and non-binary people at work. Learn about the issues faced by other underrepresented groups and how to apply your allyship skills to supporting them too.
Don’t expect a cookie, though. Actively working to correct injustices should be the baseline, not something special you deserve to be rewarded for. Do the work because the work matters, not because it looks good on your résumé, and give credit to those who helped you get there.
Being an ally is hard. It takes time and work and effort. Fundamentally, men could avoid this time and work and effort. Society doesn’t expect men to be allies. Men have the privilege of being able to ignore these problems if they want to. We hope this post has helped to persuade you that being an ally is important, but also achievable. You can make a difference—a huge difference—if you step up.
The material for this post was inspired (and immeasurably improved) by many women and non-binary people—at Etsy and beyond—who shared their knowledge and experience with us. We’re grateful for their time and effort.
We’d also like to acknowledge the contributions and feedback from men at Etsy who have reflected on their successes—and failures—as allies and shared what they’ve learned.
We also owe a debt to some of the resources made available by NCWIT and The Ada Initiative, as well as the countless people who have written books, blog posts, and talks that have helped us gain a better understanding of this complex topic.
This post references a number of external studies and articles on the research behind issues of diversity in tech and society in general, which are listed below. For more information on the business of allyship, check out our list of recommended reading for allies.
- Understanding the Gender Pay Gap, Payscale
- Argument Cultures and Unregulated Aggression, Heddleston
- Women bosses more likely to be called ‘bitchy’, ’emotional’ and ‘bossy’, Sheffield
- The abrasiveness trap, Snyder
- We’re Making the Wrong Case for Diversity in Silicon Valley, Pittinsky
- The Persistence of Retaliation Against Employees, Knezevich
- I’m a Slack designer, and my world changed when I made an emoji with brown skin like mine, Brito
- Save us, Princess!, Lettvin
- Women at the White House have started using a simple, clever trick to get heard, Werber
- Prattle of the sexes, Hammond
- Speaker sex and perceived apportionment of talk, Cutler, Scott
- Code of Conduct 101 + FAQ, Dryden
- Why Women Don’t Apply for Jobs Unless They’re 100% Qualified, Mohr
- If you think women in tech is just a pipeline problem, you haven’t been paying attention, Thomas
- Women computer science grads: The bump before the decline, Mitchell
- Women In Tech: The Facts, Ashcraft, McClain, Eger
- Why men fear paternity Leave, Paquette
- Strong Families, Strong Business: A Step Forward in Parental Leave at Etsy, Gorman
This is the second post in a series of three about Etsy’s API, the abstract interface to our logic and data. The previous post is about concurrency in Etsy’s API infrastructure. This post covers the operational side of the API infrastructure.
Operations: Architecture Implications
How do the decisions for developing Etsy’s API that we discussed in the first post relate to Etsy’s general architecture? We’re all about Do It Yourself at Etsy. A cloud is just other people’s computers, and not in the spirit of DIY; that’s why we rather run our own datacenter with our own hardware.
Also, Etsy kicks it old school and runs on a LAMP stack. Linux, Apache, MySQL, PHP. We’ve already talked about PHP being a strictly sequential, single-threaded, shared-nothing environment, leading to our choice of parallel cURL. In the PHP world, everything runs through a front controller, for example index.php. In that file, we have to include other PHP files if we need them, and to make that easier, we usually use an autoloader to help with dependencies.
Every web request gets a new PHP environment in its own instance of the PHP interpreter. The process of setting up that environment is called bootstrap. This bootstrapping process is a fixed cost in terms of CPU work, regardless of the actual work required by the request. By enabling multiple, concurrent HTTP sub-requests to fetch data for a single client request, this fixed cost was multiplied. Additionally, this concurrency encouraged more work to be done within the same wall clock time. Developers built more diverse features and experiences, but at the cost of using more back-end resources. We had a problem.
Problem: PHP time to request + racking servers D:
As more teams adopted the new approach to build features in our apps and on the web, more and more backend resources were being consumed, primarily in terms of CPU usage from PHP. In response, we added more compute capacity, over time growing the API to four times the number of servers prior to API v3. Continuing down this path we would have exhausted space and power in our datacenters. This was not a long term solution.
To solve this, we tried several strategies at once. First, we skipped some work by allowing to mark some subrequests as optional. This approach was abandoned because people used it as a graceful error recovery mechanism, triggering an alternate subrequest, rather than for optional data fetches. This didn’t help us reduce the overall work required for a given client request.
Also, we spent some time optimizing the bootstrap process. The bootstrap tax is paid by all requests and subrequests, making it a good place to focus our efforts. This initially showed benefit with some low hanging fruit, but it was a moving target in a changing codebase, requiring constant work to maintain a low bootstrap tax. We needed other ways of doing less work.
A big step forward was the introduction of HTTP response caching. We had to add caching quickly, and first tried the same cache we use for image serving, Apache Traffic Server. While being great for caching large image files, it didn’t work as well for smaller, latency sensitive API responses. We settled on Varnish, which is fast and easy to configure for our needs. Not all endpoints are being cached, but for cached ones, Varnish will serve the same response many times. We accept staleness for a small 10 – 15 minute period, drastically reducing the amount of work required for these requests. For the cacheable case, Varnish handles thousands of requests per second with a 80% hit rate. Because the API framework requires input parameters to be explicit in the HTTP request, this meshed well with introducing the caching tier. The framework also transparently handles locale, passing the user’s language, currency and region with every subrequest, which Varnish uses to manage variants.
The biggest step forward came from a courageous experiment. Dan from the core team looked at bigger organizations that faced the same problem, and tried out facebook’s hhvm on our API cluster. And got a rocketship. We could do the same work, but faster, solving this issue for us entirely. The performance gain from hhvm was a catalyst for performance improvements that made it into PHP7. We are now completely switched over to PHP7 everywhere, but it’s unclear what we would have done without hhvm back in the day.
In conclusion, concurrency proved to be great for logical aggregation of components, and not so great for performance optimization. Better database access would be better for that.
Problem: Balancing the load
If we have a tree-like request with sub-requests, a simple solution would be to route this initial request via a load balancer into a pool, and then run all subrequests on the same machine. This leads to a lumpy distribution of work. The next step from here is one uniform pool, and routing the subrequests back into that pool again, to have a good balance across the cluster. Over time (and because we experimented with hhvm), we created three pools that correspond to the three logical tiers of endpoints. In this way, we can monitor and scale each class of endpoints separately, even though each node in all three clusters works the same way.
Where would this not work?
If we sit back and think about this for a bit – how is this architecture specific to Etsy’s ecosystem? Where wouldn’t it work? What are the known problems?
The most obvious gaping hole is that we have no API versioning. How do we even get away with that? We solve this by keeping our public API small and our internal API very very fluid. Since we control both ends of the internal API via client generation and meta-endpoints, the intermediate language of domain objects is free to evolve. It’s tied into our continuous deployment system, moving along with up to 60 deploys per day for etsy.com. And the client is constantly in flux for the internal API.
And as long as it’s internal at Etsy, even the outside layer of bespoke AJAX endpoints is very malleable and matures over time.
Of course this is different for the Apps and the third party API, but those branch off after maturing on the internal API service over several years. Software development companies who focus on an extensive public API or even have that as the main service could not work in this way. They would need an internal place to let the API endpoints mature, which we do on the internal API service that is powering Etsy.
Let’s talk about tooling that we built to learn more about the behavior of our code in practice. Most of the tools that we developed for API v3 are around monitoring the new distributed system.
CrossStitch: Distributed tracing
As we know, with API v2, we had the problem that almost an arbitrary amount of single threaded work could be generated based on the query parameters, and this was really hard to monitor. Moving from the single-threaded execution model to a concurrent model triggering multiple API requests was even more challenging to monitor. You can still profile individual requests with the usual logging and metrics, but it’s hard to get the entire picture. Child requests need to be tied to back to the original request that triggered them, but they themselves might be executed elsewhere on the cluster.
To visualize this, we built a tool for distributed tracing of requests, called CrossStitch. It’s a waterfall diagram of the time spent on different tasks when loading a page, such as HTTP requests, cache queries, database queries, and so on. In darker purple, you can see different HTTP requests being kicked off for a shop’s homepage, for example you see the request for the shop’s about page is running in parallel with requests for other API components.
Fanout Manager: Feedback on fanout limit exhaustion for developers
Bespoke API calls can create HTTP request fanout to concurrent components, which in turn can create fanout to atomic component endpoints. This can create a strain on the API and database servers that is not easy for an endpoint developer to be aware of when building something in the development environment or rolling out a change to a small percentage of users.
The fanout manager aims to put a ceiling on the overall resource requests that are in flight, and we are doing this in the scheduler part of the curl callback orchestrator by keeping track of sub-requests in memcached. When a new request hits the API server, a key based on the unique user ID of that root request is put into memcached. This key works as a counter of parallel in-flight requests for that specific root request. The key is being passed on to the concurrent and component endpoint subrequests. When the scheduler runs a subrequest, it increments the counter for that key. When the request got a response and it’s slot is freed in the scheduler, the counter for the key is decremented. So we always know how many total subrequests are in-flight for one root request at the same time.
In a distributed system like this, multiple requests can be competing for the same slot. We have a problem that requires a lock.
To avoid the lock overhead, we circumvent the distributed locking problem by relying on memcached’s atomic increment and decrement operation. We optimistically first increment the memcached key counter, and then check whether the operation was valid and we actually got the slot. Sometimes we have to decrement again because this optimistic assumption is wrong, but in that case we are waiting for other requests to finish anyway and the extra operation makes no difference.
If an endpoint has too many sub-requests in flight, it just waits before being able to make the next request. This provides a good feedback for our developers about the complexity of the work before the endpoint goes into production. Also, the fanout limit can be hand-tweaked for specific cases in production, where we absolutely need to fetch a lot of data, and a higher number of parallel requests speeds up that fetching process.
Automated documentation of endpoints: datafindr
We also have a tool for automated documentation of new endpoints. It is called datafindr. It shows endpoints and typed resources, and example calls to them, based on a nightly snapshot of the API landscape.
Wanted: Endpoint decommission tool
Writing new endpoints is easy in our framework, but decommissioning existing endpoints is hard. How can we find out whether an existing endpoint is still being used?
Right now we don’t have such a tool, and to decommission an existing endpoint, we have to explicitly log whether that specific endpoint is called, and wait for an uncertain period of time, until we feel confident enough to decide that no one is using it any more. However, in theory it should be possible to develop a tool that monitors which endpoints become inactive, and how long we have to wait to gain a statistically significant confidence of it being out of use and safe to remove.
This is the second post in a series of three about Etsy’s API, the abstract interface to our logic and data. The next post will be published in two weeks, and will cover the adoption process of the new API platform among Etsy’s developers. How do you make an organization switch to a new technology and how did this work in the case of Etsy’s API transformation?
Back in 2014, Etsy started using the ELK (Elasticsearch, Logstash & Kibana) stack. We’ve previously written about how we use saved searches as a reactive security mechanism. When we made the transition to ELK, we noticed there was no way to automatically schedule searches and be notified on the results. Today, we’re introducing our open source solution to this problem: 411.
411 is a query scheduler: it executes saved Elasticsearch queries against your cluster, formats the results, and sends them to you as an alert. Our motivation behind creating 411 was to enable us to easily create customizable alerts to enhance our ability to react to important security events.
As a part of that customizability, 411 gives you lots of options for applying filters to the results that are returned from a search. This includes things like removing duplicate alerts, throttling the number of alerts, as well as the ability to forward the alerts on to other systems like JIRA or a webhook. 411 also provides a robust way to handle responding to multiple alerts as well as an audit log to help keep track of changes to searches and alerts.
In addition to the default search functionality to utilize ELK, we’re also including some additional searches types out the box. HTTP is a lightweight Nagios alternative that you can use to alert on service outages, while Graphite allows you to query a Graphite server and alert when thresholds are exceeded (more information on Etsy’s graphite setup can be found here).
We aimed to make the code as modular as possible to make it easy to extend. We hope that you find the examples we’ve provided to be useful and informative for others to develop their own extended functionality for 411. If you do create new extensions for 411, we more than welcome reviewing your pull request to integrate new functionality into 411!
Etsy’s commitment to Open Source means we use the same version of 411 as what’s available on Github, so you can expect regular project updates. We’ve heard a lot of different ideas regarding different functionality/search types people would like to see based off our talk at Defcon, so we’d appreciate your feedback regarding how you plan on using 411 and features you’d like to see!
This post was written as a collaboration between Ken Lee and Kai Zhong. You can follow Ken on Twitter at @kennysan and you can follow Kai on Twitter at @sixhundredns.
At Etsy we have been doing some pioneering work with our Web APIs. We switched to API-first design, have experimented with concurrency handling in our composition layer, introduced strong typing into our API design, experimented with code generation, and built distributed tracing tools for API as part of this project.
We faced a common challenge: much of our logic was implemented twice. All of the code that was built for the website then had to be rebuilt in our API to be used by our iOS and Android apps.
Problem: repeated logic between platforms
We wanted an approach where we built everything on reusable API components that could be shared between the web and apps. Unfortunately our existing API framework couldn’t support this shared approach. The solution we settled on was to abandon the existing framework and rebuild it from scratch.
Follow along this case study of building an API First architecture, in which functional changes are expressed on the API level before integrating them into the website. Hear what problems prompted this drastic change. Learn which new tools we had to build to be able to work with the new system and what mistakes we made along the way. Finally, how did it end? How did the team adopt the new system and have we succeeded in our goals of API First?
This post will be the first post in a series about our current API infrastructure, which we call version 3. The series is based on a talk at QCon New York. The first post will cover concurrency, the second post will cover operations and the third post the human aspects of our API transition.
If we look into the future, it comes with lots of devices. Mainframes became desktop computers, which became portable laptops and tablets, smart phones and watches.
This trend has been going on for a while, and in order to not reinvent the world on each different device, we started sharing data via an internal API years ago.
The first version of Etsy’s API was a gateway for flash widgets. And the second one was a JSON RESTful API for 3rd parties and internal use. It was tightly coupled to the underlying database schema, and it empowered clients to make customized complex requests. It was so powerful that when we introduced our first iPad App, we did not need to write any new endpoints, and could build it solely on existing ones. Clients could request multiple resources at once, for example request shop data and also include listing data from that shop, and they could specify fields to trim down the response to just the required data. Very powerful.
Second Problem: Performance & complexity control
With great power comes great responsibility, and this approach had some drawbacks. The server code was simple, but we did not know the incoming parameters. We gave the clients control over the complexity of the request via the request parameters. This obviously had implications on server-side performance. And measuring the performance was difficult, because it was not clear if an increased response time was due to the performance of our backend, or because the client requested more resources.
Third Problem: Repetition & inconsistency
Years of changing patterns and an evolving complex codebase with MVC architecture led to bad habits: data fetch during template rendering, and logic in the templates. Our API was for AJAX, whereas the backend code was in PHP. We did not have the logic in one place that was reusable for both the Web and API. This lead to inconsistencies between API and pre-API web.
The schema of the API resource was a snapshot of the data model at the time of exposing it via the endpoint. This one-to-one mapping caused problems with data migrations, as the API resource was “frozen in time”. Should it change with the model? How long should the old resource structure be supported?
Requirements for API-first
We re-discussed the requirements for our API. If performance, manifesting for the user as latency from request to response, was a problem, what was the bottleneck?
First, the time to glass, the time until we see something on our device’s screen, as Ilya Grigorik calls it in his talk “breaking the 1000 milliseconds time to glass”, and he states that due to mobile network speed, we have only 100 milliseconds on the server side if we want to stay in budget. The second problem is that we, at Etsy, come from a sequential-shared-nothing-php-world. No built-in concurrency. How can we parallelize and reuse our work, while still keeping the network footprint low?
API v2: repeated logic between platforms API v3: reusable components
Other requirements were how to think about caching. The previous version of the API was memcached only, caching calls including parameters, which lead to a granularity problem. And one last requirement was to solve the problem starting from what we know and what we’re good at – building our own solutions in PHP.
Shaping our mental model
Based on these learnings, we piece-by-piece architected a new version, called API Version 3. REST resources worked well for both mobile apps and traditional web, so that was a keeper. A new idea was to decouple the endpoints from the framework that hosts them. Minimize the endpoints’ responsibilities to:
- declaring the route
- declaring the input expectations and the output guarantees
- implementing what happens in the endpoint.
.. and that’s about it.
We have one very simple, declarative file for each endpoint.
Everything else is architected away on purpose: StatsD error monitoring, endpoint input and output type checks, and the compilation of the full routes — all of this is handled by the framework. Authentication and access control is also handled there, based on the class of endpoint that the developer has chosen.
Enter the meta-endpoint
We picked up the industry ideas from Netflix and eBay’s ql.io of server side composition of resources into device-view-specific resources. Or in other words: allowing a second layer of endpoints that are consumers of our own API, requesting and aggregating other endpoints. This means the server itself is also a client of the API, making the server more complex, while giving it more control with an extra layer for code execution. This improves performance of the client, because it only needs to make one single request – the biggest bottleneck if we want to have a responsive mobile interface!
These requests used our generated PHP client, and they used cURL. cURL? Let’s talk about this for a bit. And let’s take a step back. The interesting question is how to bring concurrency into the single-threaded world of PHP.
cURL is cool
We’re in an HTTP context, so what about making additional HTTP requests for concurrency? We examined whether this could be done with cURL.
Some time in 2013, Paul tweeted
“curl_multi_info_read() is my new event loop.”
In a hack week project, Paul and Matt from Etsy’s core team figured out that we could in fact achieve concurrency in the HTTP layer, through parallel cURL calls with curl_multi_info read. The HTTP layer is an interesting layer for this, since there are many existing solutions for routing, load balancing and caching.
In addition to cURL, we added logic to establish dependencies on requests to other endpoints, which we call proxies. We are running the requests when the corresponding proxy becomes unblocked, similar to an event loop, which you might know from NodeJS. The whole concurrency dependency analysis and scheduling is encapsulated within one piece of software, which we call the curl callback orchestrator.
This is great, because from the endpoint author’s point of view the code looks sequential and single-threaded and is just a list of proxy calls to other endpoints. We’re getting closer to a declarative style, expressing our intent, and the orchestrator figures out how to schedule the calls that are necessary for the complete result.
Ok, so we had some good observations about the previous versions of our API, and we have a working prototype for concurrency via cURL.
How did we grow an entire new API framework from here?
Perspectives and Services
Two concepts are special about Etsy’s API v3: perspectives and services.
Perspectives clarify data access rules and give us security hints on what code is permitted for each perspective. They express on whose behalf an API call is being made. So, for example, the Public perspective shows data that a logged-out user would be able to see on Etsy.com.
The Member perspective is for calls made on behalf of a particular Etsy member. The user ID is determined via the user cookie or OAuth token, dependent on the Service, which we will talk about below. The Shop perspective is similar to the member perspective but is for a shop. The framework will verify that the given shop is owned by the authenticated user. The Admin perspective is like the member perspective but for Etsy Admin. We occasionally want to take actions from our own servers that may not fit the other perspectives. For this we have the Infrastructure perspective. It is only available on the private internal API and can be used for things such as dataset loading. The application perspective is for calls made on behalf of a particular API application. It contains the application data for the verified API key.
While perspectives express on whose behalf a call is being made, the service indicates from where the call is being made. A service can also be thought of as the entry point into the API framework. Each service has its own requirements regarding authentication. Endpoints are included in some services by default. Other services are opt-in, and each endpoint has to declare whether it wants to be exposed on those opt-in services.
An example API call
Let’s look at an example request to the etsy.com homepage. We know what the homepage looks like: sections of information that might be interesting for me, as a potential buyer. Up at the top are the listings that I favorited, then some picks that Etsy’s recommendation algorithms picked for me, new items from my favorite shops, activity from my friends, and so on. I think about it as something like this.
If we look at the data in more detail, we see even more structure. It’s like a tree, growing from left to right.
Our setup of network and servers is mirroring the structure of the API call. It starts with an HTTP request from my browser to Etsy’s web server. From there, a bespoke API request is being made to our API server, requesting a personalized version of the homepage data. Internally, this request consists of multiple concurrent components. They themselves are fetched via API requests. Such as my favorites, which are a concurrent component, because they are a large number of listing cards that can be fetched in parallel.
So we can imagine an API request as a multi-level tree, kicking off other API requests and constructing an overall result from the results of those subrequests.
Domain specific language of API endpoints
The project that got me started diving deep into Etsy’s API v3 framework was striving to unify the syntax of API endpoints. This was really fun and involved big, automated changes to unify the API codebase. In the past, there were multiple styles in which endpoints could be written. To unify them, we carved out a language of endpoint building blocks.
Some building blocks are mandatory for each endpoint. Each endpoint needs to declare its route, so we know where it should be found on the web. Also, it needs a human readable description, and a resultType.
The result type describes what type of data the endpoint returns. All data we return is JSON encoded, but here we can say that we return a primitive data type, such as a string or a boolean inside that encoding. Or we could return what we call “a typed resource” – a compound type that refers to a specific component of the Etsy application domain, such as a ListingCard.
And then there is the handle function. In there, every endpoint runs the code that it needs to run, to build its response.
Optional building blocks of an API endpoint are also possible. declareInput is only necessary if the endpoint does actually need input parameters. If it doesn’t, the function can be left out.
The includedServices function allows an endpoint to opt into specific services. The EtsyApps service is opt-in for example, so if you want to make your endpoint available on the apps, you have to opt into the EtsyApps service via this function.
And then there is the cacheTtlSeconds function, which allows you to specify whether an endpoint should be cached, and what should be it’s time to live.
Input and output: Typed parameters, typed result
The first step when a request is being routed to the endpoint, is the setup of the input parameters. We create an input object based on the request’s URL and the endpoint’s declareInput function.
The input declaration tells us how to check for optional or mandatory input parameters, which are parsed according to a pattern in the route. If a parameter is missing or of the wrong type, the framework returns an HTTP error code and message. The input declaration specifies a type for each parameter, such as a string or a user ID. The types are Etsy-specific, and each one comes with its own validation function which is being run by the framework. According to the perspective, information about the logged in user, the logged in admin, shop, or authenticated app is being checked as well, and added to the input object.
Each endpoint specifies its own output type via the resultType function. Currently, those types are optional and of different level of detail. We encourage developers to either return a primitive datatype, or to build a compound type, called typed resource, corresponding to the shape of the data that their endpoint returns. Type guarantees are useful for the API clients, and bring us one step closer to having guarantees on our data from the browser input field to the the database record.
Tooling: API compiler
We need two more pieces of software, which we can automatically compile based on the endpoint declaration files. This is the job of the API compiler. Initially, this was a script that took the routes from the endpoint declarations, together with the service and perspective information, and compiled these into full routes for apache by modifying the .htaccess files. Performance concerns were alleviated by splitting up the work and files by perspective.
This is the first post in a series of three about Etsy’s API, the abstract interface to our logic and data. The next post covers the operational side of Etsy’s API.
Etsy believes in the power of diversity. We believe that having diverse perspectives will help us make better decisions and build better products. We also know that it’s not enough to just recruit diverse talent: we’ve got to retain it!
A key to retaining diverse talent is fostering a supportive work environment. There are a lot of major organizational changes that can help (flexible work arrangements, equal pay, and opportunities for growth and leadership to name a few), but what can you—the individual—really do to help?
It sounds like you want to be an ally! An ally is a person in a position of privilege who offers to share the power, access, and authority that come with that privilege with members of a non-privileged group.
Diversity is intersectional, not limited to gender, race, or any other single axis of identity. Great news: Allyship is intersectional as well! If you’re a man, you can serve as an ally to women. If you’re white, you can serve as an ally to people of color. If you can see, you can serve as an ally to people with vision loss. Anyone can use their privilege to create opportunities for people more marginalized than themselves.
On August 11 in Dublin, Etsy software engineers Toria Gibbs and Ian Malpass will be running a workshop on being an effective male ally to people who identify as women and other underrepresented populations in tech.
One important strategy for being an effective ally is self-education. Women are frequently expected to teach introductory feminism and entertain discussions on “being a woman in tech” with anyone who asks. It’s a great burden to shoulder and frankly a waste of their time. You wouldn’t ask Rasmus to teach you how to write a Hello World program in PHP, right? No! You would go out and find the articles, tutorials, and forum threads that already exist for beginners.
With that, we introduce our list of recommended reading for allies.
Why do we need feminism? Analogies on privilege
Opinion pieces, personal experiences
Other fun stuff
While this list is not exhaustive, it should be more than enough to get you started on your journey. Happy learning!
If you’re interested in hosting your own event to promote male allyship, we recommend checking out NCWIT’s Male Allies and Advocates Toolkit or Ada Initiative’s Ally Skills Workshop.
You can read more about Etsy’s diversity in our latest annual Diversity and Equality Progress Report.
Update 08/12/2016: Slides from Toria and Ian’s presentation are now available on Speaker Deck!
Spring has sprung and we’re back to share how Etsy’s performance fared in Q1 2016. In order to analyze how the site’s performance has changed over the quarter, we collected data from a week in March and compared it to a week’s worth of data from December. A common trend emerged from both our back-end and front-end testing this quarter: we saw significant site-wide performance improvements across the board.
Several members of Etsy’s performance team have joined forces to chronicle the highlights of this quarter’s report. Moishe Lettvin will start us off with a recap of the server-side performance, Natalya Hoota will discuss the synthetic front-end changes and Allison McKnight will address the real user monitoring portion of the report. Let’s take a look at the numbers and all the juicy context.
Server-side time measures how long it takes our servers to build pages. These measurements don’t include any client-side time — this measurement is the amount of time it takes from receiving an HTTP request for a page to returning the response.
As always, we start with these metrics because they represent the absolute lowest bound for how long it could take to show a page to the user, and changes in these metrics will be reflected in all our other metrics. This data is calculated by a random sample of our webserver logs.
Happily, this quarter we saw site-wide performance improvements, due to our upgrade to PHP7. While our server-side time isn’t spent solely running PHP code (we make calls to services like memcache, MySQL, Redis, etc.), we saw significant performance gains on all our pages.
One of the primary ways that PHP7 increases performance is by decreasing memory usage. In the graph below, the blue & green lines represent 95th percentile of memory usage on a set of our PHP5 web servers, while the red line indicates the memory usage on a set of our PHP7 web servers during our switch-over from PHP5 to PHP7. This is two days’ worth of data; the variance in memory usage (more visible in the PHP5 data) is due to daily variation in server load. Note also that the y-axis origin is 0 on this graph — the decrease shown in this graph is to scale!
Another interesting thing that’s visible in the box plots above is how the homepage server-side timing changed. It sped up across the board, but the distribution widened significantly. The reason for this is that the homepage gets many signed-out views as well as signed-in views, as contrasted with, for instance, the cart view page. In addition, the homepage is much more customized for signed-in users than, say, the listing page. This level of customization requires calls to our storage systems, which weren’t sped up by the PHP7 upgrade. Therefore, the signed-in requests didn’t speed up as much as the signed-out requests, which are constrained almost entirely by PHP speed. In the density plot below, you can see that the bimodal distribution of the homepage timing became more distinct in Q1 (red) vs Q4 of last year (blue):
The Baseline page gives us the clearest view into the gains from PHP7 — the page does very little outside of running PHP code, and you can see in the first chart above that the biggest impact was on the Baseline page. The median time went from 80ms to 29ms, and the variance decreased.
Synthetic Start Render
We collect synthetic (i.e., obtained by web browser emulation of scripted web visits) monitoring results in order to cross-check our findings for two other sets of data. We use a third party provider, Catchpoint, to record start render time — a moment when a user first sees content appearing on the screen — as a reference point.
Synthetic tests showed that all pages got significantly faster in Q1.
It is worth noting that the data for synthetic tests is collected for signed-out web requests. As previously mentioned, this type of request involves fetching less state from storage, which highlights PHP7 wins.
I noticed the unusually far-reaching outliers on the shop page and search pages and decided to investigate further. At first, I generated scatterplots for the two pages in question, which clarified that outliers on their own did not have a story to tell — no clustering patterns or high volumes on any particular day. Having a better data sanitation process would have eliminated the points in question altogether.
In order to get a better visual for the synthetic data we used, I took comparative scatterplots of all six pages we monitor. I noticed a reduction of failed tests (marked by red diamonds) that happened sometime between Q4 and present day. Remarkably, we have never isolated that data set before. A closer look revealed an important source of many current failures: fetching third party resources was taking longer than the maximum time allotted by Catchpoint. That encouraged us to consider a new metric for monitoring the impact of third party resources.
Real User Page Load Time
Gathering front-end timing metrics from real users allows us to see the full range of our pages’ performance in the wild. As in past reports, we’re using page load metrics collected by mPulse. The data used to generate these box plots is the median page load time for each minute of one week in each quarter.
We see that for the most part, page load times have gone down for each page. The boxes and whiskers of each plot have moved down, and the outliers tend to be faster as well. But it seems that we’ve also picked up some very slow outliers: each page has a single outlier of over 10 seconds, while the rest of the data points for all pages are far under that. Where did these slow load times come from?
Because each of our RUM data points represents one minute in the week and there is exactly one extremely slow outlier for each page, it makes sense that these points might be all from the same time. Sure enough, looking at the raw numbers shows that all of the extreme outliers are from the same minute!
This slow period happened when a set of boxes that queue Gearman jobs were taken out to be patched for the glibc vulnerability that surfaced this quarter. We use Gearman to run a number of asynchronous jobs that are used to generate our pages, so when the boxes were taken out and fewer resources were available for these jobs, back-end times suffered.
One interesting thing to note is that we actually didn’t notice when this happened. The server-side times (and therefore also the front-end times) for most of our pages suffered an extreme regression, but we weren’t notified. This is actually by design!
Sometimes we experience a blip in page performance that recovers very quickly (as was the case with this short-lived regression). It makes little sense to scramble to understand an issue that will automatically resolve within a few minutes — by the time we’ve read and digested the alert, it may have already resolved itself — so we have a delay built in to our alerts. We are only alerted to a regression if a certain amount of time has passed and the problem hasn’t been resolved (for individual pages, this is 40 minutes; when a majority of our pages’ performance degrades all at once, the delay is much shorter).
This delay ensures that we won’t scramble to respond to an alert if the problem will immediately fix itself, but it does mean that we don’t have any insight into extreme but short-lived regressions like this one, and that means we’re missing out on some important information about our pages’ performance. If something like this happens once a week, is it still something that we can ignore? Maybe something like this happens once a day — without tracking short-lived regressions, we don’t know. Going forward, we will investigate different ways of tracking short-lived regressions like this one.
You may also have noticed that while the slowdown that produced these outliers originated on the server side, the outliers are missing from both the server-side and synthetic data. This is because of the different collection methods that we use for each type of data. Our server-side and synthetic front-end datasets each contain 1,000 data points sampled from visits throughout the week (with the exception of the baseline page, which has a smaller server-side dataset because it receives fewer visits than other pages). This averages to only 142 data points per day — far under one datapoint per minute — and so it’s likely that no data from the short regression made it into the synthetic or server-side datasets at all. Our front-end RUM data, on the other hand, has one datapoint — the median page load time — for every minute, so it was guaranteed that the regression would be represented in that dataset as long as at least 50% of views were affected.
The nuances in the ways that we collect each metric are certainly very interesting, and each method has its pros and cons (for example, heavily sampling our server-side and synthetic front-end data leads to a narrower view of our pages’ performance, but collecting medians for our front-end RUM data to display in box plots is perhaps statistically unsound). We plan to continue iterating on this process to make it more appropriate to the report and more uniform across our different monitoring stacks.
In the first quarter of 2016, performance improvements resulting from the server-side upgrade to PHP7 trickled down through all our data sets and the faster back-end times translated to speed ups in page load times for users. As always, the process of analyzing the data for the sections above uncovered some interesting stories and patterns that we may have otherwise overlooked. It is important to remember that the smaller stories and patterns are just as valuable of learning opportunities as the big wins and losses.