SEO Title Tag Optimization at Etsy: Experiment Design and Causal Inference
External search engines like Google and Bing are a major source of traffic for Etsy, especially for our longer-tail, harder to find items, and thus Search Engine Optimization (SEO) is important in driving efficient listing discovery on our platform.
We want to make sure that our SEO strategy is data-driven and that we can be highly confident that whatever changes we implement will bring about positive results. At Etsy, we constantly run experiments to optimize the user experience and discovery across our platform, and we therefore naturally turned to experimentation for improving our SEO performance. While it is relatively simple to set up an experiment on-site on our own pages and apps, running experiments with SEO required changing how Etsy’s pages appeared in search engine results, over which we did not have direct control.
To overcome this limitation, we designed a slightly modified experimental design framework that allows us to effectively test how changes to our pages affect our SEO performance. This post explains the methodology behind our SEO testing, the challenges we have come across, and how we have resolved them.
For one of our experiments, we hypothesized that changing the titles our pages displayed in search results (a.k.a. ‘title tags’) could increase their clickthrough rate. Etsy has millions of pages generated off of user generated content that were suitable for a test. Many of these pages also receive the majority of their traffic through SEO.
Below is an example of a template we used when setting up a recent SEO title tag experiment.
We were inspired by SEO tests at Pinterest and Thumbtack and decided to set up a similar experiment where we randomly assigned our pages into different groups and applied different title tag phrasings shown above. We would measure the success of each test group by how much traffic it drove relative to the control groups. In this experiment, we also set up two control groups to have a higher degree of confidence in our results and to be able to quality check our randomized sampling once the experiment began.
We took a small sample of pages of a similar type while ensuring that our sample was large enough to allow us to reach statistical significance within a reasonable amount of time.
Because visits to individual pages are highly volatile, with many outliers and fluctuations from day to day, we had to create relatively large groups of 1000 pages each to expect to reach significance quickly. Furthermore, because of the high degree of variance across our pages, simple random sampling of our pages into test groups was creating test groups different from each other in a statistically significant way even before the experiment began.
To ensure our test groups were more comparable to each other, we used stratified sampling, where we first ranked the the pages to be a part of the test by visits, broke them down into ntile groups and then randomly assigned the pages from each ntile group into one of the test groups, ensuring to take a page from each ntile group. This ensured that our test groups were consistently representative of the overall sample and more reliably similar to each other.
We then looked at the statistical metrics for each test group over the preceding time period, calculating the mean and standard deviation values by month and running t-tests to ensure the groups were not different from each other in a statistically significant way. All test groups passed this test.
Estimating Causal Impact
Although the test groups in our experiment were not different from each other at a statistically significant level before the experiment, there were small differences that prevented the estimation of the exact causal impact post treatment. For example, test group XYZ might see an increase relative to control B, but if Control B was slightly better than test groups XYZ even before the experiment began, simply taking the difference between of the two groups would not be the best estimate of the difference the treatment had effected.
One common approach to resolve this problem is to calculate the difference of differences between the test and control groups pre- and post-treatment.
While this approach would have worked well, it might have created two different estimated treatment effect sizes when comparing the test groups against the two different control groups. We decided that, instead, using Bayesian structural time series analysis to create a synthetic control group incorporating information from both the control groups would provide a cleaner analysis of the results.
In this approach, a machine learning model is trained using pre-treatment data to predict the performance of each test group based on its covariance relative to its predictors — in our case, the two control groups. Once the model is trained, it is used to generate the counterfactual, synthetic control groups for each of the test groups, simulating what would have happened had the treatment not been applied.
The causal impact analysis in this experiment was implemented using the CausalImpact package by Google.
We started seeing the effects of our test treatments as soon as a few days after the experiment start date. Even seemingly very subtle title tag changes resulted in large and statistically significant changes in traffic to our pages.
In some test groups, we saw significant gains in traffic.
While in others, we saw no change.
And in some others, we even saw a strong negative change in traffic.
The two control groups in this test showed no statistically significant difference compared to each other after the experiment. Although a slight change was detected, the effect did not reach significance.
Post-experiment rollout validation
Once we identified the best performing title tag, the treatment was rolled out across all test groups. The other groups experienced similar lifts in traffic and the variance across buckets disappeared, further validating our results.
The fact that our two control groups saw no change when compared to each other, and also the fact that the other buckets experienced the same improvement in performance once the best performing treatment was applied to them gave us strong basis for confidence in the validity of our results.
It appeared in our results that shorter title tags performed better than longer ones. This might be because for shorter, better targeted title tags, there is a higher probability of a percentage match (that could be calculated using a metric like the Levenshtein Distance between the search query and the title tag) against any given user’s search query on Google.
In a similar hypothesis, it might be that using well-targeted title tags that are more textually similar to common search terms helps to increase percentage match to Google search terms and therefore improves ranking.
However, it is likely that different strategies work well for different websites, and we would recommend rigorous testing to uncover the best SEO strategy tailored for each individual case.
- Have two control groups for A-A testing. This allowed us to have much greater confidence in our results.
- The CausalImpact package can be used to easily account for small differences in test vs. control groups and estimate the differences of treatments more accurately.
- For title tags, it is most likely a best practice to use phrasing and wording that would maximize the probability of a low Levenshtein distance match from popular target search queries on Google