There and Back Again: Migrating Geolocation Data to GeoNames

Posted by on March 26, 2013

People are passionate about where they live. At Etsy we need to keep track of lots of different locations, such as buyers’ billing and shipping addresses and sellers’ shop locations. As Etsy continues to expand internationally we wanted to provide better localization and translations for our location place names. We determined that the best way to effect this change was to move from using a closed location API provider to internal services backed by the open GeoNames data set.

Before we could start using GeoNames as our data source we had to map all our existing user locations onto the GeoNames data. There is no established method for creating this mapping so we looked at the data we had and figured out a method. For our existing user locations we had city, region, country, country code, latitude and longitude in a format like this:

tokyo             tokyo prefecture   japan            jp   35.670 139.740
luxembourg city   luxemburg          luxembourg       lu   49.609 6.129
newport           england            united kingdom   gb   50.700 -1.295
earth             texas              united states    us   34.277 -102.508

To execute our mapping we created a PHP script that was able to leverage the much broader amount of data GeoNames provides us and quickly determine if old and new locations matched up. This script is now available on Etsy’s github page. The script relies on a MySQL DB backend that contains the standard data tables distributed by GeoNames with indexes on place names, country codes, and the latitude and longitude columns.

For each row in our source data we attempt a few different strategies to try to find a match to the GeoNames data.

Our first step is to see if there are any exact text matches to the place name we have in the GeoNames data set. If there is one or more exact matches we sort them by distance from the source latitude and longitude and if the nearest place is less than 20km away we call it a match. 20km is a rather large radius and we could easily run through with much lower limits, but we found that since we sorted by distance already a large radius gave us more positive matches when our data sets disagreed on where the center of large cities should be.

If we don’t find an exact text match we look again twice more, once with wildcards around the place name in our source data and once searching the “alternatenames” column in the GeoNames data set instead of the “name” column. This helps us find locations whose names may have changed or have alternate spellings. Results are sorted by distance and again we keep the closest if it falls within our threshold.

If we still have not found a match we take a substring of the first several letters of the name and do a wildcard search for that. (The length of the substring depends on the length of the source name and is generally about 35%.) This helps eliminate problems resulting from inconsistent inclusion of things like ‘City’, ‘Township’ and ‘Borough’ from place names in our source data.

For this project it was important that we found a match for every location and so we added a final “match at all costs” step.  If none of the previous steps have succeeded we use the source location’s latitude and longitude and get a list of all locations in the source country that are within a few tenths of a degree.  We then sort the results and pick the closest.  This was necessary for less than 1% of our data and allowed us to continue providing local search services even when we were unable to match a city exactly. This was a very exploratory, iterative process as we discovered what worked and what did not, and where our existing data was incomplete or inaccurate. From tuning the radius of the search areas, to dropping all the region data outside of the US (we found that it was contributing a lot of false positive matches, and the results were better without it), we revised and refined the logic until it gave us satisfactory results.

After we established a mapping we added a new field to store each seller’s “GeoNameID” in our DB. Where previously we stored all the fields listed above now we need only the GeoNameID.  We do still store the old data format as well which has allowed us to make an easy and transparent transition from using the old data to the new data.  With access to the full breadth of data provided by GeoNames we have been able to speed our Local Search feature.  We now also localize spellings of place names for our international users, so someone searching in German can search near Köln while someone searching in English can look for Cologne.

I will be giving a talk on how we handle search and localized place name auto-suggest at Lucene/Solr revolution in San Diego in May and we’ll be publishing more Code as Craft blog posts on geolocation services in the coming months here as well.

You can follow John Marc on Twitter @thejohnmarc


Related Posts

No responses yet. You could be the first!