SEO Title Tag Optimization at Etsy: Experiment Design and Causal Inference

Posted by on October 25, 2016

External search engines like Google and Bing are a major source of traffic for Etsy, especially for our longer-tail, harder to find items, and thus Search Engine Optimization (SEO) is important in driving efficient listing discovery on our platform.

We want to make sure that our SEO strategy is data-driven and that we can be highly confident that whatever changes we implement will bring about positive results. At Etsy, we constantly run experiments to optimize the user experience and discovery across our platform, and we therefore naturally turned to experimentation for improving our SEO performance. While it is relatively simple to set up an experiment on-site on our own pages and apps, running experiments with SEO required changing how Etsy’s pages appeared in search engine results, over which we did not have direct control.

To overcome this limitation, we designed a slightly modified experimental design framework that allows us to effectively test how changes to our pages affect our SEO performance. This post explains the methodology behind our SEO testing, the challenges we have come across, and how we have resolved them.

Experiment Methodology

For one of our experiments, we hypothesized that changing the titles our pages displayed in search results (a.k.a. ‘title tags’) could increase their clickthrough rate. Etsy has millions of pages generated off of user generated content that were suitable for a test. Many of these pages also receive the majority of their traffic through SEO.

Below is an example of a template we used when setting up a recent SEO title tag experiment.


We were inspired by SEO tests at Pinterest and Thumbtack and decided to set up a similar experiment where we randomly assigned our pages into different groups and applied different title tag phrasings shown above. We would measure the success of each test group by how much traffic it drove relative to the control groups. In this experiment, we also set up two control groups to have a higher degree of confidence in our results and to be able to quality check our randomized sampling once the experiment began.


We took a small sample of pages of a similar type while ensuring that our sample was large enough  to allow us to reach statistical significance within a reasonable amount of time.


Because visits to individual pages are highly volatile, with many outliers and fluctuations from day to day, we had to create relatively large groups of 1000 pages each to expect to reach significance quickly. Furthermore, because of the high degree of variance across our pages, simple random sampling of our pages into test groups was creating test groups different from each other in a statistically significant way even before the experiment began.

To ensure our test groups were more comparable to each other, we used stratified sampling, where we first ranked the the pages to be a part of the test by visits, broke them down into ntile groups and then randomly assigned the pages from each ntile group into one of the test groups, ensuring to take a page from each ntile group. This ensured that our test groups were consistently representative of the overall sample and more reliably similar to each other.


We then looked at the statistical metrics for each test group over the preceding time period, calculating the mean and standard deviation values by month and running t-tests to ensure the groups were not different from each other in a statistically significant way. All test groups passed this test.


Estimating Causal Impact

Although the test groups in our experiment were not different from each other at a statistically significant level before the experiment, there were small differences that prevented the estimation of the exact causal impact post treatment. For example, test group XYZ might see an increase relative to control B, but if Control B was slightly better than test groups XYZ even before the experiment began, simply taking the difference between of the two groups would not be the best estimate of the difference the treatment had effected.

One common approach to resolve this problem is to calculate the difference of differences between the test and control groups pre- and post-treatment.

While this approach would have worked well, it might have created two different estimated treatment effect sizes when comparing the test groups against the two different control groups. We decided that, instead, using Bayesian structural time series analysis to create a synthetic control group incorporating information from both the control groups would provide a cleaner analysis of the results.

In this approach, a machine learning model is trained using pre-treatment data to predict the performance of each test group based on its covariance relative to its predictors — in our case, the two control groups. Once the model is trained, it is used to generate the counterfactual, synthetic control groups for each of the test groups, simulating what would have happened had the treatment not been applied.

The causal impact analysis in this experiment was implemented using the CausalImpact package by Google.


We started seeing the effects of our test treatments as soon as a few days after the experiment start date. Even seemingly very subtle title tag changes resulted in large and statistically significant changes in traffic to our pages.

In some test groups, we saw significant gains in traffic.


While in others, we saw no change.


And in some others, we even saw a strong negative change in traffic.


A-A Testing

The two control groups in this test showed no statistically significant difference compared to each other after the experiment. Although a slight change was detected, the effect did not reach significance.

Post-experiment rollout validation

Once we identified the best performing title tag, the treatment was rolled out across all test groups. The other groups experienced similar lifts in traffic and the variance across buckets disappeared, further validating our results.


The fact that our two control groups saw no change when compared to each other, and also the fact that the other buckets experienced the same improvement in performance once the best performing treatment was applied to them gave us strong basis for confidence in the validity of our results.


It appeared in our results that shorter title tags performed better than longer ones. This might be because for shorter, better targeted title tags, there is a higher probability of a percentage match (that could be calculated using a metric like the Levenshtein Distance between the search query and the title tag) against any given user’s search query on Google.

In a similar hypothesis, it might be that using well-targeted title tags that are more textually similar to common search terms helps to increase percentage match to Google search terms and therefore improves ranking.

However, it is likely that different strategies work well for different websites, and we would recommend rigorous testing to uncover the best SEO strategy tailored for each individual case.



Image Credits:

Visualization of Stratified Sampling

Why don’t you just ask the sellers instead of constantly using our shops as guinea pigs? We optimize and know what works but then you perform these tests and throw everything out of wack.
Some of us actually can figure it out on our own.
Searchers (potential buyers) now click on a link from an outside search like Google and instead of bringing them direct to that item, you bring up the market page search results where the item they were interested in half the time is not even appearing. One lost sale opportunity at a time.
Now you create a generic line: “Browse unique items from SARANTOS on Etsy, a global marketplace of handmade, vintage and creative goods” and display that on my shop link that appears in Google instead of my customized SEO related line that I formerly had from my announcement section. Mine was: “Modern jewelry designs in rose gold, yellow gold, sterling silver and precious gems handcrafted by independent contemporary studio jeweler Susan Sarantos.” Which of those do you think will invite more clicks for people looking for the items that I sell? Think how many individual shops on Etsy and their unique customized lines that you are losing that opportunity to attract potential buyers to the site. These are target customers that miss our shops because you use a single generic line instead. Think how many search words and search results you are missing out on using that generic phrase instead of the individual ones from each shop. I really believe these two changes are the major reason for drops in views.

Shorter title tags perform better because more people use them than the longer ones but the few people who do search for the longer title tags want what actually comes up to be what the longer title tag is, that is why they use the longer title tags. Those are the ones that will most likely purchase what they find.

By the way… Whatever Etsy was doing in March was working really well. After that not so good until a few days ago when things seemed to return to (my old) normal again.
Susan Sarantos

    Hello Susan,
    Thank you for your feedback! I totally agree that seller-generated title tags could perform better than generic tags. I also agree with your observation that longer title tags could in some cases be better. We will take into account your thoughts in our future work, and will do our best to run our tests mindfully so as not to disrupt the sellers on Etsy.
    Thank you,


Thanks for this A-A tests about SEO title tags !

“It appeared in our results that shorter title tags performed better than longer ones. ” I’m agree with that. With the tests I made by my own I got better CTR on shorter title tag that in longer title tags.

Amazing! Nice content. SEO title tag optimization got to learn something for sure!
Very interesting A/B testing approach. So do you mind to share white title structure won? 🙂

    We are not able to share the winning variants, but I think the post gives good pointers that will enable you to discover the best title tags for your website. Good luck! ?

      Just use the way back machine and look through the main category pages. Alternatively have a look at the increase in folder visibility using search metrics.

That’s a really interesting study, great that you have shared it in detail so others can benefit too – thanks!

Hi Bill,

thank you for this comprehensive analysis. Lots of interesting findings to keep in mind.


    Hi Christian, Glad you found the post helpful! 🙂

Because Etsy’s search is so literal, and rewards exact title/tag matches, I have been practicing “Title Smearing” (my own term), in which I overlap keyword phrases to maximize my exact title/tag matches.

For example, a title of mine might read;
‘rhinestone choker necklace black velvet choker with charm…,’ which allows me to exactly match “rhinestone choker” “choker necklace” “necklace black” “black velvet choker” “choker with charm….,” all of which are popular keyword phrases in search, and by linking these phrases with a shared keyword, I have been able to increase the number of exact title/tag matches in each listing.

It works well, however it is unreadable and ugly. Because of my success with “Title Smearing,” I hesitate to change. Your article mentions the Levenshtein Distance, and the test’s success with shorter titles. Would shorter titles, with a low Levenshtein distance be more effective, even though I would not always be able to get exact tag/title matches, than my current system?

    Our internal search within Etsy works differently from Google’s search engine, so I wouldn’t necessarily extrapolate these results to search within Etsy, but you are welcome to test out your hypothesis! 🙂

Well-documented post, Bill Ulammandakh. Interesting to see your use of causalImpact package by Google. I am both surprised a bit and pleased with the results of your study. Thanks for sharing it.

Thanks Bill for sharing this experiment. I appreciate that. We’ve also experimented and watched that “Shorter title tags performed better than longer ones” in max times… and I also believe to have “Brand Name” in Title tag.

Are you able to share the character length averages of the before and after titles to get a sense of the length changes. Thank you for sharing your findings, very insightful!

    Thank you, Alex! I’m not able to disclose the lengths, but we would at the longest only use title tags that will fit inside a standard Google search results page. The specific length that will work best for you will also depend on your content and your popular target search terms, so the optimal length will vary from case to case. Good luck!

Hey Bill,

Excellent read. I have a couple of questions:

1. Did you use the rga library as mentioned in the guide? Are there any specific settings used in your test?

2. Variant A has seen significant gains in traffic. Could you provide more details in how big were these gains? Percentage-wise?


    Thanks, Mihai!

    1. I’m not familiar with the lunametrics guide you have mentioned, but it looks like it’s related to Google analytics? We only use coding in SQL, R and PHP for these types of tests and might consider building out more internal tooling in the future.
    2. I’m not able to disclose the percentage gains, but I can say that even small changes can result in quite large effects!

I’m just glad there’s someone out there to do all this testing and write up it nice and neat so I don’t have to. Ha!

Nice work.

When you’ll finaly find best schema to build titles you can use automation in to quickly optimize it on bigger scale 🙂

    Nice article! Great to see your statements backed up by testing.

    How can that be (Chris)? Did you check their domain authority?

    The edge that SEO experts can offer over amateurs is the ability to take overwhelming amounts of data and make meaningful decisions from it.

shorter title tags has better look in search engine .. relevance and informative is a plus *CMIIW

Geez could have told you this before the test…just 20 years of doin SEO told me that ages ago. The thing you didn’t mention was that short title is good in some places where a generic category is the target or a specific product. You also want to have pages with longer titles that can rank for longtail. But agree if you are trying to rank for used cars “Buy used cars” is the title you want. his also proves again that Title IS the most important HTML element on the page… even a single incoming link from an authority site won’t move the needle as much as a rocking good title and is a damn site easier to do!

    Thank you for your comment! It’s good to see that our results agree with your experience! In this particular experiment, shorter title tags performed better, but I agree that it is definitely possible that longer titles can perform better in some cases as well. 🙂

Great testing analysis and thanks for sharing. Wondering if you can eloborate more into the testing metric(s). At the beginning you mention increasing CTR, yet in your results you are charting Visits. While visits could be an indicator from changes in organic click through rate, CTR is also dependent upon Impressions and effected by organic rankings. While it’s clear the changes to title tags increased visits (assuming you segmented organic here) CTR may not be the only contributing factor here (if at all really). ~Thanks!

    Thank you! Glad you found the post helpful. That is a very good catch! The experiment was originally motivated by a goal to increase CTR, but we eventually ended up deciding on visits as our key metric as that was closer to our ultimate goal and was more straightforward to measure. As with most marketing experiments, we also analyzed the impact on conversions and revenue, although that data is not shown here.

Thanks for this Bill. Many of my clients have limited pages to experiment with. This insight will help me, help them generate more and hopefully better traffic


    Thank you, Lyndon! Glad you found the article helpful.

The only thing missing is the conversion information for each of the control groups and variants. What good does more traffic do if it drops conversion? This seems to be an uncompleted exercise. While traffic may be your goal, Susan Saranto’s point is very valid, they are there for conversions.

    Hi Paul, Thank you for your comment! We couldn’t disclose conversions and sales due legal restrictions, but we do look at those metrics for our experiments and would not roll out a change if it hurt conversions.

Thank you SO MUCH for sharing your experiment so eloquently with us! I often see reports and papers where the methods used to reach the SEO suggestions/ click bait are the very definition of vague.

In other words, this was a very refreshing read, and I look forward to hearing more from you!

    Thank you for your kind words! 🙂 We are looking forward to sharing our future findings as well!

That’s probably the most detailed Title test I’ve ever seen. Fantastic job guys and really interesting results.

OMG! This test is really wonderful and you guys described it so well. Keep it up. I am SEO Intern and this is really cool to understand. I think title tag is one of the most important part of on-page SEO and getting it done precisely can surely make a difference. I usually follow a strategy for making a title.- If the title is having a long tail search term/keyword in it then I think we don’t need to add up some extra words/fillers just to increase the title length because as you desribed it would increase the Levenshein distance leading to lowing ranking. Again, thanks for sharing!

    Thank you! Glad you found it it helpful! 🙂

How does this apply to the recent change that Google implemented with the longer format of title and description? Great study, Thank you.

    Thank you! We don’t have data to share right now on how the results might be affected by the change you mentioned, but we will be sharing more of our learnings in the future, so stay tuned! 🙂