Multilingual User Generated Content and SEO

Posted by on December 2, 2011

By: Lacy Rhoades & David Bernal

Etsy offers a lot of items from a lot of sellers worldwide. Now since we’ve started to better support our international members, this means Etsy also comes in a variety of languages. We rely heavily on search engines to bring people looking for something unique to our Sellers and their Etsy Shops. As luck would have it, search engines speak a variety of languages too.

Who speaks our language?

A persistent cookie, initially set through our language detection logic, determines what language we show users on Etsy. This makes for the most ideal experience by allowing us to present what we refer to as a “unified marketplace” and avoid segmented set of items for separate geographic regions. All content on Etsy, all Etsy Shops, Listings and features (no matter what language) are found at Etsy.com and there is no segmentation. Piece of cake. Right?

What does this mean for search results?

Interesting question: What does a single unified multilingual marketplace mean for search providers like Google, Bing or Yahoo? As we came to find out, it means that when said providers come along to crawl and index our amazing content, they only browse the pages of Etsy in English. This is a problem!

We ask our users, “Hey! What language would you like to see?”

Unfortunately, search crawlers don’t care to answer such inquiries. Search engine bots, in their desire to see all content without prejudice, don’t send a browser accept language, aren’t signed-in users with profile locations, and it hardly makes sense to use geo-IP lookups to determine their location. Because of this, all content shows up in English for automated search crawlers, which means our foreign language search results and page ranks were extinct long before they could ever be created.

Search providers must first establish what region or language the searcher is interested in, but they must also establish what region or language the search results are in. Our goal is to make it as easy as possible for search bots to do just that.

To simulate the way search engines crawl Etsy you can use curl to retrieve the contents of an Etsy URL. At this point in our work with multilingual content, there was no way to simply “curl” a URL on Etsy and ever receive anything but English content.

A simple response to this is to this is to publish our content at different addresses for different languages. The simple unified Etsy.com would remain as the marketplace for real users, but for search crawlers we would publish simple multilingual domains like de.etsy.com where all content would appear in German, or it.etsy.com where content would appear in Italian.

Search engines start saying, “Hey! What’s the big idea!?”

The majority of our content published across these multiple domain names is still only in English and as it turns out, search engine providers really do not like it when you multiplex copies of identical content across multiple addresses and multiple domain names.

Thankfully this problem (like any other issue) has been seen before and there’s a well thought out, convenient solution. The major search providers have establish a way to say (in your site’s HTML code) that some content is identical in part or in whole to other content elsewhere. This code also allows you to specify if content is intended for segments of the international audience speaking certain languages.

Now, who here speaks HTML? <rel alternate=”…”/>

Given a situation where an Etsy Listing is not translated into several languages, we would append this HTML code to the English Listing page (inside the <head> tags). This alleviates any confusion as to what language-speakers this content is intended for. It assists search providers in ignoring duplicate content elsewhere.

<link rel="canonical" href="http://www.etsy.com/listing/83154191/etsy-stickers-keeping-it-real-set-of-5" />
<link rel="alternate" hreflang="fr" href="http://fr.etsy.com/listing/83154191/etsy-stickers-keeping-it-real-set-of-5" />
<link rel="alternate" hreflang="nl" href="http://nl.etsy.com/listing/83154191/etsy-stickers-keeping-it-real-set-of-5" />

There are two things at play here:

  1. We specify a “canonical” url, which is the one true URL to view this content, in whatever language is appears in on the page, this way when search engines see it elsewhere, they know to give its pagerank to canonica URL, avoiding dilution.
  2. We specify “alternate” urls, which tell search engines that there are similar-looking pages around that have the same content, but different chrome, that is, the content is the same, but the site navigation and boilerplate is in the language specified by the hreflang attribute.

Given the Etsy Listing has been translated to German, this code would be appended to the German version of that same Etsy Listing:

<link rel="canonical" href="http://de.etsy.com/listing/83154191/etsy-autoaufkleber-5-stuck" />

Notice there are no alternates listed for the German content. That content can’t be found anywhere else. There is no ambiguity and so we have no worries of misrepresenting it as duplicate content.

Note that rel=”alternate” is only used for pages with the same content, but different languages for the navigation and boilerplate. Search engines use this information to send users to the version of the site they’re most-likely to be able to move around in.

We know how to say, “Stay out!” in every language.

Another angle to be sure we don’t publish mislabeled English-only content is to make use of robots.txt and “robot” type meta tags. Using combinations of this technique we can suggest that search providers not index multilanguage content that will only be available in the future.

This might mean that robots.txt will list “disallow” directives for chunks of the site. Like this:

Disallow: /listings/*

Also we may say that if an Etsy Listing is not translated into French, viewing that Listing from fr.etsy.com will result in rendering meta tags like this:

<meta name="robots" content="noindex">

Always know how to ask for directions!

One of the best ways to help yourself out with SEO is to provide a map for your search providers. This standard is known as the Sitemap protocol. This sitemap file is specific to the international subdomain or top level domain the request came from.

An Etsy Listing will show up in the sitemap.xml of any subdomain corresponding to the languages the listing is available in. For example, a listing available in Italian will show up in the sitemap.xml at it.etsy.com, while a listing only available in English would not. Any Etsy Listing is available to any user on Etsy, but targeting the sitemap.xml in this way allows us to indicate to search providers what the region and language are for a particular Etsy Listing.

All the data contained inside the sitemap.xml and robots.txt file is generated dynamically, querying out for only content translated into that specific region and language. You can check out our SEO-related files below. (Be sure to clear any region- or language-specific cookies you might have set already.)

http://www.etsy.com/robots.txt
http://www.etsy.com/etsymap_listing_50m_sitemap_index.xml
http://de.etsy.com/robots.txt
http://de.etsy.com/etsymap_listing_de_sitemap.xml.gz

Keep a log of your journey!

At Etsy use Splunk, a data management interface, to aggregate all of our web server logs. We can use it to run periodic reports and keep an eye on the results for us. We also run automated tasks daily, essentially mimicking the role of an Etsy user. The tasks execute a search query for Etsy content using a normal search engine, and then compare the results to what we’d expect to see for such a query.

Using this system we can keep an eye on:

If we see a surge in regional traffic without a commensurate rise in the data feeding that region or language, we can tell that search indexes are likely tainted and need attention.

Now don’t wander off anywhere. Check back soon for more updates about our adventures in multilingual content.

Posted by on December 2, 2011
Category: engineering, internationalization

2 Comments

[…] Multilingual User Generated Content and SEO […]

There is a missing opportunity from both monolingual search engines and content providers. It’s kind of sad. The fact to have one unique URI that a French user can just paste to his Japanese friend is one wonderful thing of content negotiation. How do we leverage this, so search engines are able to understand it?

Enter the never used 1998 RFC: “Transparent Content Negotiation in HTTP” (RFC 2295). http://tools.ietf.org/html/rfc2295

I wonder what it would take for Google to negotiate with content providers and offering this feature.

The solution is super simple. We were living in a world where sites were quite monolingual but this has changed since. 🙂