Atomic deploys at Etsy

Posted by on July 1, 2013 / 16 Comments

A key part of Continuous Integration is being able to deploy quickly, safely and with minimal impact to production traffic. Sites use various deploy automation tools like Capistrano, Fabric and a large number of homegrown rsync-based ones. At Etsy we use a tool we built and open-sourced called Deployinator.

What all these tools have in common is that they get files onto multiple servers and are able to run commands on those servers. What ends up varying a lot is what those commands are. Do you clear your caches, graceful your web server, prime your caches, or even stagger your deploys to groups of servers at a time and remove them from your load balancer while they are being updated? There are good reasons to do all of those things, but the less you have to do, the better. It should be possible to atomically update a running server without restarting it and without clearing any caches.

The problem with deploying new code to a running server is quite simple to understand. A request that starts on one version of the code might access other files during the request and if those files are updated to a new version during the request you end up with strange side effects. For example, in PHP you might autoload files on demand when instantiating objects and if the code has changed mid-request the caller and the implementation could easily get out of synch. At Etsy, it was quite normal to have to split significant code changes over 3 deploys before implementing atomic deploys. One deploy to push the new files. A second deploy to push the changes to make use of the new code and a final deploy to clean up any outdated code that isn’t needed anymore.

The problem is simple enough, and the solution is actually quite simple as well. We just need to make sure that we can run concurrent requests on two versions of our code. While the problem statement is simple, actually making it possible to run different versions of the code concurrently isn’t necessarily easy.

When trying to address this problem in the past, I’ve made use of PHP’s realpath cache and APC (a PHP opcode cache, which uses inodes as keys). During a deploy the realpath cache retains the inodes from the previous version, and the opcode cache would retain the actual code from the previous version. This means that requests that are currently in progress during a deploy can continue to use the previous version’s code as they finish. WePloy is an implementation of this approach which works quite well.

With PHP 5.5 there is a new opcode cache called Opcache (which is also available for PHP 5.2-5.4). This cache is not inode-based. It uses realpaths as the cache keys, so the inode trick isn’t going to work anymore. Relying on getting the inodes from a cache also isn’t a terribly robust way of handling the problem because there are still a couple of tiny race windows related to new processes with empty caches starting up at exactly the wrong time. It is also too PHP-oriented in that it relies on very specific PHP behaviour.

Instead of relying on PHP-specific caching, we took a new look at this problem and decided to push the main responsibility to the web server itself. The base characteristic of any atomic deploy mechanism is that existing requests need to continue executing as if nothing has changed. The new code should only be visible to new requests. In order to accomplish this in a generic manner we need two document roots that we toggle between and a new request needs to know which docroot it should use. We wrote a simple Apache module that calls realpath() on the configured document root. This allows us to make the document root a symlink which we can toggle between two directories. The Apache module sets the document root to this resolved path for the request so even if a deploy happens in the middle of the request and the symlink is changed to point at another directory the current request will not be affected. This avoids any tricky signaling or IPC mechanisms someone might otherwise use to inform the web server that it should switch its document root. Such mechanisms are also not request-aware so a server with multiple virtual hosts would really complicate such methods. By simply communicating the docroot change to Apache via a symlink swap we simplify this and also fit right into how existing deploy tools tend to work.

We called this new Apache module mod_realdoc.

If you look at the code closely you will see that we are hooking into Apache first thing in the post_read_request hook. This is run as soon as Apache is finished reading the request from the client. So, from this point on in the request, the document root will be set to the target of the symlink and not the symlink itself. Another thing you will notice is that the result of the realpath() is cached. You can control the stat frequency with the RealpathEvery Apache configuration directive. We have it set to 2s here.

Note that since we have two separate docroots and our opcode cache is realpath-based, we have to have enough space for two complete copies of our site in the cache. By having two docroots and alternating between them on successive deploys we reuse entries that haven’t changed across two deploys and avoid “thundering herd” cache issues on normal deploys.

If you understand things so far and have managed to compile and install mod_realdoc you should be able to simply deploy to a second directory and when the directory is fully populated just flip the docroot symlink to point to it. Don’t forget to flip the symlink atomically by creating a temporary one and renaming it with “mv -T“. Your deploys will now be atomic for simple PHP, CGI, static files and any other technology that makes use of the docroot as provided by Apache.

However, you will likely have a bit more work to do for more complex scenarios. You need to make sure that nothing during your request uses the absolute path to the document_root symlink. For example, if you configure Apache’s DOCUMENT_ROOT for your site to be /var/www/site/htdocs and then you have /var/www/site be a symlink to alternatingly /var/www/A and /var/www/B you need to check your code for any hardcoded instances of /var/www/site/htdocs. This includes your PHP include_path setting. One way of doing this is to set your include_path as the very first thing you do if you have a front controller in your application. You can use something like this:

ini_set('include_path', $_SERVER['DOCUMENT_ROOT'].'/../include');

That means once mod_realdoc has resolved /var/www/site/htdocs to /var/www/A/htdocs your include_path will be /var/www/A/htdocs/../include for the remainder of this PHP request and even if the symlink is switched to /var/www/B halfway through the request it won’t be visible to this request.

At Etsy we don’t actually have a front controller where we could easily make this app-level ini_set() call, so we wrote a little PHP extension to do it for us. It is called incpath.

This extension is quite simple. It has three ini settings. incpath.docroot_sapi_list specifies which SAPIs should get the docroot from the SAPI itself. incpath.realpath_sapi_list lists the SAPIs which should do the realpath() call natively. When the extension does the realpath() itself it is essentially a PHP version of the mod_realpath module resolving the symlink in the extension itself. And finally, incpath.search_replace_pattern specifies the string to replace in the existing include_path. It is easier to understand with an example. At Etsy we have it configured something like this:

incpath.docroot_sapi_list = apache2handler
incpath.realpath_sapi_list = cli
incpath.search_replace_pattern = /var/www/site/htdocs

This means that when running PHP under Apache we will get the document root from Apache (apache2handler) and we will look for “/var/www/site/htdocs” in the include_path and replace it with the document root we got from Apache. For cli we will do the realpath() in the extension and use that to substitute into the include_path. Our PHP configuration then has the include_path set to:

/var/www/site/htdocs/../include:.

which the incpath extension will modify to be either /var/www/A/htdocs/../include or /var/www/B/htdocs/../include.

This include_path substitution is done in the RINIT PHP request hook which runs at the beginning of every request before any PHP code has run. The original include_path is restored at the end of the request in the RSHUTDOWN PHP hook. You can, of course, specify different search_replace_pattern values for different virtual hosts and everything should work fine. You can also skip this extension entirely and do it at the app-level or even through PHP’s auto_prepend functionality.

Some caveats. First and foremost this is about getting atomicity for a single web request. This will not address multi-request atomicity issues. For example, if you have a request that triggers AJAX requests back to the server, the initial request and the AJAX request may be handled by different versions of the code. It also doesn’t address changes to shared resources. If you change your database schema in some incompatible way such that the current and the new version of the code cannot run concurrently then this won’t help you. Any shared resources need to stay compatible across your deploy versions. For static assets this means you need proper asset versioning to guarantee that you aren’t accessing incompatible js/css/images.

If you are currently using Apache and a symlink-swapping deploy tool like Capistrano, then mod_realdoc should come in handy for you, and it is likely to let you remove a graceful restart from your deploy procedure. For non-Apache, like nginx, it shouldn’t be all that tricky to write a similar plugin which does the realpath() call and fixates the document root at the top of a request insulating that request from a mid-request symlink change. If you use this Apache module or write your own for another server, please let us know in the comments.

16 Comments

Introducing Kale

Posted by on June 11, 2013 / 19 Comments

In the world of Ops, monitoring is a tough problem. It gets harder when you have lots and lots of critical moving parts, each requiring constant monitoring. At Etsy, we’ve got a bunch of tools that we use to help us monitor our systems. You might be familiar with some of them: Nagios, StatsD, Graphite, and Ganglia. Today, we’d like to introduce you to a new tool that we’ve been working on for the past few months.

This tool is designed to solve the problem of metrics overload. What kind of overload are we talking about? Well, at Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage.

Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.

We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.

Skyline

Skyline is an anomaly detection system. It shows all the current metrics that have been determined to be anomalous:

screenshot

You can hover over all the metric names and view the graphs directly. Our algorithms do a good job filtering most of the non-anomalous metrics, but for now, they certainly aren’t as good at humans at pattern matching. So, our philosophy is to err on the side of noise – it’s very easy to scroll through a few more false positives if it means you get to catch all the real anomalies.

Once you’ve found a metric that looks suspect, you can click through to Oculus and analyze it for correlations with other metrics!

Oculus

Oculus is the anomaly correlation component of the Kale system. Once you’ve identified an interesting or anomalous metric, Oculus will find all of the other metrics in your systems which look similar.

It lets you search for metrics, using your choice of two comparison algorithms…

search_algo

and shows you other metrics which are similar for the same time period.

results_screenshot

You can even save interesting metrics into a collection (complete with your own notes), for example if a particular set of related graphs occurred during a complex outage:

collection_save

Oculus will then search through all of your saved metric collections along with your other metrics, and will show you any matches alongside your regular search results. What’s that? You’re seeing the same pattern of graphs as you saved during that last site outage you had? Gee, thanks Oculus!

collection_results

Going further into the juicy technical depths of each system is a tad beyond the scope of this introductory post, but we’ll be sure to post more about it in the coming weeks. We will also be speaking about it at Velocity next week (see the abstract here), so if you’re around, come and say hi! In the meantime, we are open sourcing everything today, and we are looking forward to feedback from the community. Go ahead and try it out! You can find Skyline and Oculus at http://github.com/etsy/skyline and http://github.com/etsy/oculus.

monitoring <3,
Abe and Jon

Abe tweets at @abestanway, and Jon tweets at @jonlives

Update: It came to our attention that the “Loupe” name is also used by a commercial product, and so we have changed the name to Kale.

19 Comments

Leveraging Big Data To Create More Secure Web Applications

Posted by on June 4, 2013 / No Responses

Here at Etsy, we take measuring things very seriously. We have previously discussed how we harness data to make decisions about software, operations and products, but we have said little about just how useful data can be for the information security practices of a modern web application.

Our Data Team has written several blog articles about how we build, maintain and take advantage of our data infrastructure. Over the years, our data stack has grown to consist of several technologies, including Hadoop and Cascading (which we’ve written about in the past.) At a high level, Cascading leverages Hadoop’s infrastructure while abstracting standard data processing operations, such as splits and joins, away from the underlying mapper and reducer tasks. Cascading allows us to write analytics jobs quickly and easily in a familiar languages (we use both the JRuby and Scala DSL‘s for Cascading). With a mature Hadoop stack and a multitude of helpful data engineers, the Security Team at Etsy has been taking advantage of the data stack in increasing amounts over the past few months to strengthen our security posture.

Broadly speaking, there are three main types of security practices in which we utilize big data: reactive security mechanisms, proactive security mechanisms, and incident response security practices.

Reactive Security Mechanisms

Reactive security mechanisms usually consist of real-time event monitoring and alerting. These security mechanisms focus on events that trigger immediate responses based on regular information querying: they query the same data and they query it often.

Some examples of reactive security mechanisms at Etsy are automated Splunk searches and anomaly detection based on StatsD/Graphite logs.

Image

We use saved Splunk searches to identify anomalous patterns in access logs and error logs, such as cross-site scripting and increasing failed log-in rates. These searches typically run once a minute in order to give us a real-time monitoring mechanism. We also built an anomaly detection system based on logs of potentially suspicious events such as failed logins. These two mechanisms are similar with respect to the frequency of data aggregated (constantly) and the frequency of analysis (almost constantly).

As you may guess, reactive security mechanisms cannot be easily implemented in Hadoop. Although Hadoop is fast (isn’t that the point!), the main benefit of using big data infrastructure is to churn through huge quantities of data. For reactive security mechanisms, we want to get as close as possible to real-time results. It wouldn’t be efficient or rational to run this type of analysis on our Hadoop cluster.

Although reactive security mechanisms aren’t performed on our cluster, the initial data gathering step is perfectly suited for Hadoop. Figuring out where the thresholds lie for certain metrics by performing predictive analytics and forecasting on past data is a fantastic way to save time that would previously have been spent over several weeks fine-tuning Splunk queries.

Proactive Security Mechanisms

Proactive security mechanisms seek to reduce attack surface or eliminate entire vulnerability classes. This category includes mechanisms such as content security policy, output encoding libraries or full-site SSL. These mechanisms are intended to improve the long-term security posture of the application, rather than collect data about an on-going attack.

Similarly to reactive security mechanisms, we can use predictive analytics and forecasting in Hadoop to weigh the value of our proactive security mechanisms. For example, when determining if our load balancers could handle activating full-site SSL for all Etsy sellers, we ran Hadoop jobs that analyzed past traffic of our sellers to figure out how many requests were made by sellers that were not HTTPS. Armed with this data, as well as metrics from our load balancers, we were able to push out full-site SSL for sellers without encountering unexpected capacity issues.

require 'helpers/analytics'

analytics_cascade do
  analytics_flow do
    analytics_source 'event_logs'
    tap_db_snapshot 'users_index'

    assembly 'event_logs' do
      group_by 'user_id', 'scheme' do
        count 'value'
      end
    end

    assembly 'users_index' do
      project 'user_id', 'is_seller'
    end

    assembly 'ssl_traffic' do
      project 'user_id', 'is_seller', 'scheme', 'value'
      group_by 'is_seller', 'scheme' do
        count 'value'
      end
    end

    analytics_sink 'ssl_traffic'
  end
end

The Cascading.jruby source code for this task is relatively straightforward.

graph copy

The output of this Hadoop job allowed us to visualize Seller HTTP traffic vs Seller HTTPS traffic.

However, unlike reactive security mechanisms, we can use Hadoop to create proactive security mechanisms. The key difference is that these security mechanisms cannot be used to monitor critical metrics that require immediate response or attention. Fortunately, the results of proactive security mechanisms typically don’t require immediate attention. In the SSL example, before we were able to push out full-site SSL for sellers, we had a daily Hadoop job that would break down request patterns to show us which URLs/patterns were requested most often over HTTP. We used this data to iteratively change the scheme of these high value URLs from HTTP to HTTPS while still having data to support the fact that this wouldn’t surpass our restriction on load-balancer terminated SSL connections.

Incident Response

Web application incident response is something that is done often in practice, as it is not limited to responding to full blown compromise. We often need to investigate a threat that has recently targeted organizations that are similar to our own or investigate a new exploit that may affect our technology stack. Typically we are looking for identifying patterns such as URL patterns or IP addresses that have repeatedly accessed our application. Since similar incident response actions are performed frequently, our incident response practices need to be repeatable and, since an incident can occur at any time, we must be able to get the results of our analysis quickly. Even though we want our analysis to be generalizable, it is dependent on the particular threat we are responding to and thus the parameters often need to be changed.

Given all of these conditions, incident response is a perfect example of when to use big data. Incident response is ad-hoc analysis of a large dataset that is driven by an event or incident. We are not going to do it more than once and it needs to be fast. This is a textbook use-case of Hadoop and we take advantage of it constantly for this purpose. Writing template Hadoop jobs that scan our access logs for visits from target IP address or visits to known malicious URL patterns that are easily pluggable with new incident details has proved invaluable to our incident response practices.

Conclusion

The security posture of an application is directly proportional to the amount of information that is known about the application. Big data can be a great source of this kind of information and can be used to gather data to create reactive security mechanisms, gather data to create proactive security mechanisms, directly create new proactive security mechanisms, and to perform incident response.

Although the advantages of analytics from a data science perspective are well-known and well documented, the advantages of analytics from a security perspective have not been explored in-depth. We have found big data to be extraordinarily useful in both creating reactive and proactive security mechanisms, as well as to aiding in incident response. We hope that this will help other organizations in using their data and analytics capabilities more effectively.

Want to know more about how Etsy uses Hadoop to create a more secure web application? The authors of this blog post (Mike Arpaia & Kyle Barry) will be presenting a more in-depth discussion of this topic at the upcoming Black Hat USA conference and Nordic Security Conference. If this has been of interest we hope you’ll get a chance to check out our presentations.

You can follow Mike on Twitter at @mikearpaia and you can follow Kyle on Twitter at @allofmywats.

No Comments

Culture Hacking With A Staff Database

Posted by on May 31, 2013 / 6 Comments

When your startup is just you and the cat, your company communication problems are largely whether or not the cat listens when you tell it to get off your keyboard. Scale to a handful of people and you still know everyone, what they’re working on, where they are, and how to get hold of them when you need them. As you scale further, communication becomes increasingly complex. And then Mother Nature throws a hurricane at you and you really need to know that everyone is safe and well, and do it quickly and efficiently.

We’ve tried to solve “who” problems with an open, accessible staff database.

Free your staff data

Most companies will have some form of HR system with all sorts of useful information in it, but much of that is private and confidential, and locked away (and rightly so). What’s needed is something more basic: Who is this person? What do they look like? Where are they? What do they do? How to I get in touch with them?

But, because we took our HR data and re-published the “safe” bits to a database, the staff directory data is completely at our mercy. We’re not subject to what the designers of our HR information system think people should look at. And if what we need changes, we can adapt to suit.

Even better, because the data is in the same ecosystem as our regular web site data, anyone who knows how to build something for etsy.com can also build something with our staff data with no particular extra knowledge. Making it easy to hack (unsurprisingly) makes for more hacks, and more hacks allows for useful and unexpected tools to crop up to help you scale better, without any particular central planning.

Who are you?

The obvious thing to build on top of this database is a “staff directory”. We have one, it’s great. It lives with our other internal web-based support tools, easily available to all. For an organisation merrily blazing past Dunbar’s number, it’s critical in helping us maintain a sort of “outboard brain” for who’s who in the company. (This is an ongoing problem, since we’re still hiring.)

Extending that, during a past Hack Week, a group of us built a game where you were given a photo of someone, and multiple choices for who they were. Add to that score tracking and a leader board and we had people enthusiastically learning who everyone in the company was. (It was too easy for some people, so we quickly developed a mode where you had to actually type in the right name – the game equivalent of meeting them in a hallway.)

guess2

Smooth small bumps as well as large ones

Hacks don’t have to be complex to be useful. Adding a “You’ve Got Mail” button to the directory simplified the process of sorting the incoming physical mail to the office and meant that people didn’t have to go and check the mail room regularly. Being able to smooth out even minor niggles like this contributes to the feeling of an office that just works.

They don’t even have to be permanent features to be useful – one-off queries can be valuable too. At one point, I needed to work out which of my colleagues sold vintage items on Etsy in order to do some research on a new feature. I could have emailed everyone and hoped to get replies, but instead I could extract the list of staff shops from the directory, and then extract their listings from the listings table, and work out exactly who sold vintage.

Are you OK?

A rather more acute and impromptu use for the data came during Hurricane Sandy, when we wanted to be able to do a roll call to make sure our colleagues were safe and sound. It was trivial to take the data, export name, location, phone numbers, and team to CSV and import it into Google Docs, and then we had a shared document where we could track how everyone was.

At the same time, we wanted everyone to be able to get everyone’s contact details into their phones easily in case we needed to get hold of people quickly. Some quick checking of Google and Wikipedia produced the vCard spec and, while not cutting edge, the 3.0 version is very amenable to quick hacking. (More recent versions are fine too, of course, but 3.0 hit the spot for the amount of time and energy I had available for hacking.)

There are many, many fields in the full spec, but it’s the work of moments to identify the main ones we care about: name, email, phone numbers, and organization.We also happened to have a field for Skype nicks, so I went ahead and added that. (Staff members have voluntarily entered their phone numbers in our staff database, and the vCards are only accessible for the staff directory, so we felt safe that these vCards would be acceptable. You should always consider your company’s data sources and culture, as well as information security issues when hacking on company data.)

In addition, we have photos of many of our staff members on the about us page, so I did some quick work to include a photo where we had one. (The vCard spec allows you to embed photos as base64-encoded strings.) Nothing complicated, and only a few minutes to implement, but it adds a nice polish to the cards.

We can generate sets of vCards for departments and teams, the whole organisation, just new hires, etc., to make managing your contacts easier.

Call me, maybe

The last hurricane-inspired hack we did was an emergency broadcast system so that we could quickly update staff members by SMS about rapidly-changing events or urgent alerts (such as unplanned office closures). We already have a Twilio account, along with PHP libraries to use it so all that was required was to validate the phone numbers to make sure they were valid US numbers, and provide a little bit of code to filter the directory by location (we only want to SMS Brooklynites if the Brooklyn office is closed, for example) and the actual sending is simply sending a suitable request to the Twilio REST API. The implementation also included plenty of error logging and reporting and the like because it’s important to know who you didn’t alert too.

megaphone2

Future work may include extending the system to accept replies, so that we can get the bulk of roll calls done quickly and be able to concentrate on finding and helping those who need it.

When HR systems go to the dogs

One of the notable features of our staff directory’s database schema is the “human” column. The office dog pack is a sufficiently important and intrinsic part of the office culture that it was considered necessary to include them in the database. The past Hack Week saw an expansion of this with a full Doggie DB hack that lets you get to know the pack better. And it might seem trivial on the surface, but that’s because you haven’t had to use the “You’ve Got Poop” button….

doggiedb2

All these hacks were (a) easy, (b) unforeseen when we started the staff database project, and (c) helped us in growing the business and keeping our colleagues safe and informed. Grab your company’s data and see what you can do.

You can follow Ian on Twitter at @indec.

6 Comments

At Percona MySQL Conference This Week

Posted by on April 23, 2013 / 2 Comments

A few of us are at the Percona MySQL Conference this week in Santa Clara. Please come say hello, and if you have a chance today stop by the talk about development at scale at 1:20PM today Tuesday the 23rd in Ballroom B.

2 Comments

March 2013 Site Performance Report

Posted by on April 9, 2013 / 9 Comments

Four more months have gone by since our last update, and it’s time for another performance report!  We made it through the holiday season with flying colors, and our engineers have been hard at work launching new features since the new year started.  Let’s see what impact this had on performance (spoiler: we keep getting faster).

Server Side Performance:

Here are the median and 95th percentile load times for signed in users on our core pages on Thursday, 3/14/13:

Server Side Performance Chart

As you can see we had small decreases in load time across the board (the one exception being a tiny uptick in median search performance).  We expected this, since we made a couple of small changes to our application code in the last couple of months that had a positive impact across every page.  The larger decrease in the 95th percentile load time for the homepage was primarily due to the removal of the Taste Test module.

As a quick reminder, the “Baseline” page is an extremely simple page that just includes our header and footer, and uses our normal controller architecture.  Improvements in the load time of this page mean improvements across every page on Etsy.

We also thought it would be fun to look back and compare our performance today to our performance from the very first performance report, back in August of 2011.  Since we don’t look at averages anymore, we can only compare the 95th percentile:

Performance comparison 1.5 years

Looking good!  We have made huge strides on the homepage, shop page, and profile page, and a small improvement on the search page.  The listing page hasn’t seen much optimization given its position as the fastest of these pages, so we’re happy with its modest gains.  It’s extremely gratifying to see how much progress we have made over the last ~18 months, and we hope to continue this trend in the months to come.

Front-end Performance:

We changed our strategy for monitoring front-end performance in our last update, and promised more stable metrics going forward.  That proved to be the case:

Frontend Performance - March 2013

Performance looks good here as well – reductions in load time for almost every page type again.  The one outlier here is our baseline page, which saw a significant increase in document complete time since our last update.  We’ve started digging into this, and it looks like a couple of extra HTTP requests have snuck into this page – a JS file and a sprite or two.  We’ll be consolidating these requests in the coming days and we expect to see the document complete time on this page drop back down.

As far as the rest of the improvements go, we believe that they are due to a combination of changes that we’ve deliberately made to improve performance, and a better hit rate at our CDNs.  We are currently load balancing static content across multiple CDNs, and we’ve worked to improve cachability so that fewer requests have to hit our origin.

Conclusion:

Overall we are happy with the way performance has been trending, especially because we’ve been focusing our efforts over the last 4 months on projects that don’t involve these core pages.  For the next couple of quarters our plan is to establish clear performance SLAs and bring all of our major pages into compliance with them.  Beyond that we are always looking for big wins and evaluating new standards like SPDY, WebP, and pre-fetching to see if they make sense for Etsy.  It’s an exciting time to be working on web performance!

Does this sound like something you would like to work on?  Our team is hiring!

You can follow Jonathan on Twitter at @jonathanklein

9 Comments

Re-Exploring New Technologies: The Turbo Encabulator

Posted by on April 1, 2013 / 4 Comments

One of the things that we like to do here at Etsy is to push the envelope.

Pushing the envelope means reaching deep into unknown technologies and approaches, and to experiment in areas that we are not familiar with. It’s in this vein that we’d like to share some early and unfiltered work: The Turbo Encabulator.

Each member of the TE team has stretched their cognitive abilities into new realms, and we’re pretty happy with the results thus far. We know that in many ways, engineering is about solidifying a solution well before the problem statement can be fully described within a socio-technical environment.

Of course, we also acknowledge that the notion of socio-emotional technocracy needs to fit within this paradigm. This brings us to some questions:

Understandably, these challenges are unilateral in their nature and we’ve made a large investment in creating an erinaceous atmosphere in order to produce results.

One of these results is the Turbo Encabulator. As you can imagine, sinusoidal repleneration threw us for a loop. :)
An internal tech talk can be seen below:

4 Comments

There and Back Again: Migrating Geolocation Data to GeoNames

Posted by on March 26, 2013 / No Responses

People are passionate about where they live. At Etsy we need to keep track of lots of different locations, such as buyers’ billing and shipping addresses and sellers’ shop locations. As Etsy continues to expand internationally we wanted to provide better localization and translations for our location place names. We determined that the best way to effect this change was to move from using a closed location API provider to internal services backed by the open GeoNames data set.

Before we could start using GeoNames as our data source we had to map all our existing user locations onto the GeoNames data. There is no established method for creating this mapping so we looked at the data we had and figured out a method. For our existing user locations we had city, region, country, country code, latitude and longitude in a format like this:

tokyo             tokyo prefecture   japan            jp   35.670 139.740
luxembourg city   luxemburg          luxembourg       lu   49.609 6.129
newport           england            united kingdom   gb   50.700 -1.295
earth             texas              united states    us   34.277 -102.508

To execute our mapping we created a PHP script that was able to leverage the much broader amount of data GeoNames provides us and quickly determine if old and new locations matched up. This script is now available on Etsy’s github page. The script relies on a MySQL DB backend that contains the standard data tables distributed by GeoNames with indexes on place names, country codes, and the latitude and longitude columns.

For each row in our source data we attempt a few different strategies to try to find a match to the GeoNames data.

Our first step is to see if there are any exact text matches to the place name we have in the GeoNames data set. If there is one or more exact matches we sort them by distance from the source latitude and longitude and if the nearest place is less than 20km away we call it a match. 20km is a rather large radius and we could easily run through with much lower limits, but we found that since we sorted by distance already a large radius gave us more positive matches when our data sets disagreed on where the center of large cities should be.

If we don’t find an exact text match we look again twice more, once with wildcards around the place name in our source data and once searching the “alternatenames” column in the GeoNames data set instead of the “name” column. This helps us find locations whose names may have changed or have alternate spellings. Results are sorted by distance and again we keep the closest if it falls within our threshold.

If we still have not found a match we take a substring of the first several letters of the name and do a wildcard search for that. (The length of the substring depends on the length of the source name and is generally about 35%.) This helps eliminate problems resulting from inconsistent inclusion of things like ‘City’, ‘Township’ and ‘Borough’ from place names in our source data.

For this project it was important that we found a match for every location and so we added a final “match at all costs” step.  If none of the previous steps have succeeded we use the source location’s latitude and longitude and get a list of all locations in the source country that are within a few tenths of a degree.  We then sort the results and pick the closest.  This was necessary for less than 1% of our data and allowed us to continue providing local search services even when we were unable to match a city exactly. This was a very exploratory, iterative process as we discovered what worked and what did not, and where our existing data was incomplete or inaccurate. From tuning the radius of the search areas, to dropping all the region data outside of the US (we found that it was contributing a lot of false positive matches, and the results were better without it), we revised and refined the logic until it gave us satisfactory results.

After we established a mapping we added a new field to store each seller’s “GeoNameID” in our DB. Where previously we stored all the fields listed above now we need only the GeoNameID.  We do still store the old data format as well which has allowed us to make an easy and transparent transition from using the old data to the new data.  With access to the full breadth of data provided by GeoNames we have been able to speed our Local Search feature.  We now also localize spellings of place names for our international users, so someone searching in German can search near Köln while someone searching in English can look for Cologne.

I will be giving a talk on how we handle search and localized place name auto-suggest at Lucene/Solr revolution in San Diego in May and we’ll be publishing more Code as Craft blog posts on geolocation services in the coming months here as well.

You can follow John Marc on Twitter @thejohnmarc

No Comments

Data Corruption To Go: The Perils Of sql_mode = NULL

Posted by on March 19, 2013 / 9 Comments

A little while back, we hit an extremely embarrassing issue: 32-bit signed integer overflow on some primary keys.

In our architecture, an unsigned 64-bit key is generated by a global ticket server, but the sharded tables’ schemas did not use BIGINT UNSIGNED as the column data-type. Inserts into the affected tables started failing en masse with “duplicate key” errors on key 2147483647.

We quickly realized the issue and ALTERed the impacted tables, but the MySQL behavior of silently truncating all values larger than 2147483647 with the only fail-safe being the primary key constraint was worrying. Any columns that lacked a similar constraint would be experiencing silent data corruption.

Some digging led us to the MySQL variable sql_mode.

What is sql_mode anyway?

By default MySQL is very accepting of bad data and bad queries. At best, it may emit a WARNING, but these are almost always ignored by client libraries such as PHP’s PDO.

sql_mode is a run-time setting for MySQL and enabling it makes the database much more vocal by changing most WARNINGs to ERRORs. This helps prevent silent data corruption on write, and unexpected results on read. Here are a few examples of issues it helped fix within our stack and the settings we used:

PROBLEM: Missing default values, data-type and data-length mismatches

CREATE TABLE `sql_test`
    `id` TINYINT UNSIGNED NOT NULL PRIMARY KEY,
    `value` VARCHAR(3) NOT NULL,
    `column_not_null` VARCHAR(1) NOT NULL
) Engine = InnoDB;

Consider the following inserts:

  1. INSERT INTO sql_test (id, value, column_not_null) VALUES (10000, 'z', 'y');
  2. INSERT INTO sql_test (id, value, column_not_null) VALUES (1, 'abcde', 'x');
  3. INSERT INTO sql_test (id, value) VALUES (2, 'qwerty');

Here’s what the table would look like when sql_mode is disabled:

+-----+-------+-----------------+
| id  | value | column_not_null |
+-----+-------+-----------------+
|   1 | abc   | x               |
|   2 | qwe   |                 |
| 255 | z     | y               |
+-----+-------+-----------------+

Massively different from the expected values!

When sql_mode='STRICT_ALL_TABLES' is set though, each of those inserts will fail hard so that you can recover and handle the problem in your application.

  1. ERROR 1264 (22003): Out of range value for column 'id' at row 1
  2. ERROR 1406 (22001): Data too long for column 'value' at row 1
  3. ERROR 1364 (HY000): Field 'column_not_null' doesn't have a default value

PROBLEM: Character-set issues

CREATE TABLE `charset_test` (
    `message` VARCHAR(3000) NOT NULL -- NB: this could also be `message` TEXT NOT NULL
) Engine = InnoDB DEFAULT CHARSET=UTF8;

Inserting a supplementary UTF-8 character string like UTF8 supplementary character example would seem to work, but on a subsequent SELECT, the returned string would be quite different: UTF8 Supplementary Fail

This is because MySQL’s UTF8 implementation can only accommodate characters up to 3 bytes in length, but Unicode characters could be up to 4 bytes long. The data stored in the message column is subtly truncated and corrupted.

With the stricter sql_mode='STRICT_ALL_TABLES', inserts containing 4-byte characters will fail hard with: ERROR 1366 (HY000): Incorrect string value

Note that sql_mode setting here is only an alarm-bell, it is not fixing the underlying issue. The real solution is twofold:

  1. Change the character-set of the connection to MySQL to UTF8MB4. If using PHP’s PDO, pass it in via the connection DSN. Or when running queries by hand, make sure to execute SET NAMES utf8mb4 before executing anything.

  2. Modify the column and the table definitions to UTF8MB4:
    ALTER TABLE `charset_test`
        DEFAULT CHARSET utf8mb4,
        MODIFY `message` VARCHAR(3000) CHARACTER SET utf8mb4 NOT NULL;

One caveat to note when using the UTF8MB4 charset: the rule-of-thumb for an indexed VARCHAR column being at most 255 characters is no longer valid. MySQL has a length limit of 767 bytes for the index prefix and it has to accommodate the worst case. Thus when using the 3-byte UTF8 character set, 255 characters fit in, but when using 4-byte UTF8MB4 only 191 characters fit.

If we absolutely need 255 characters indexed (for example if the column holds email addresses), a workaround is to SHA-1 hash the value, and store and index that column instead.

PROBLEM: Unsigned int decrements

CREATE TABLE `decrement_test` (
    `id` int(11) NOT NULL,
    `value` int(11) unsigned NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

INSERT INTO decrement_test(id, value)
    VALUES(1, 20) ON DUPLICATE KEY UPDATE
    value = value + VALUES(value);

SELECT * FROM decrement_test;
+----+-------+
| id | value |
+----+-------+
| 1  | 20    |
+----+-------+

INSERT INTO decrement_test(id, value)
    VALUES(1, -1) ON DUPLICATE KEY UPDATE
    value = value + VALUES(value);

If you were expecting the value to now be 19, you would be wrong:

SELECT * FROM decrement_test;
+----+-------+
| id | value |
+----+-------+
| 1  | 20    |
+----+-------+

When sql_mode='STRICT_ALL_TABLES' is enabled, the decrement INSERT statement will fail with: ERROR 1264 (22003): Out of range value for column 'value' at row 1

Note that in order to do atomic decrements as shown above, the column data type would need to be signed-integer rather than unsigned-integer.

PROBLEM: Incorrect group-by

CREATE TABLE `group_by_test`
    `id` INT NOT NULL PRIMARY KEY,
    `name` VARCHAR(30) NOT NULL,
    `status` INT NOT NULL,
    `region` INT NOT NULL
) Engine = InnoDB DEFAULT CHARSET = UTF8MB4;

INSERT INTO group_by_test(id, name, status, region) VALUES (1, 'etsy', 10, 100);
INSERT INTO group_by_test(id, name, status, region) VALUES (2, 'ebay', 12, 100);
INSERT INTO group_by_test(id, name, status, region) VALUES (3, 'amazon', 13, 150);
INSERT INTO group_by_test(id, name, status, region) VALUES (4, 'fab', 10, 100);

SELECT COUNT(1), status, name FROM group_by_test GROUP BY status;
+----------+--------+--------+
| COUNT(1) | status | name   |
+----------+--------+--------+
|   2      | 10     | etsy   |
|   1      | 12     | ebay   |
|   1      | 13     | amazon |
+----------+--------+--------+

The `name` column isn’t in the GROUP BY clause, but is in the SELECT and MySQL pulls the first matching value and puts it into the result set. The output in the `name` column depends on relative order of rows in the table and may not be repeatable and this is almost never desirable.

By setting sql_mode='ONLY_FULL_GROUP_BY', we can prevent such unreliable SQL from executing. The query would instead fail with: ERROR 1055 (42000): 'etsy.group_by_test.name' isn't in GROUP BY

Just throw the switch and test in Production?

Unfortunately our application was continually executing some variations of the bad queries shown above, especially the bad GROUP BY and bad DEFAULT/NOT NULL, and we couldn’t just enable the mode in Production without causing a lot of disruption.

We had to find the bad queries and table definitions first. In order to do that, we set up a test database with the Production schema on it and enabled the strict sql_mode='ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' on it.

We slurped all queries from our Production DBs using a local tcpdump and netcat’ed them over to the test database:

prodb-shard> sudo /usr/sbin/tcpdump -i eth0
  "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2"
  -s 65535 -x -n -q -tttt |
  nc testdb.etsy.com 20000

On the test box, we used the combination of Percona Toolkit’s pt-query-digest and a custom Go script to execute these queries (NB: Percona Playback did not exist at the time). Our Production servers execute ~4000 queries/sec (~30000 packets/sec in and out) and while pt-query-digest has a built-in execute mode, it cannot keep up with the volume and the tcpdump capture would drop packets. The Go script helped multiplex this query volume and keep packet loss low (~1%).

testdb> nc -l 20000 | pt-query-digest --type tcpdump
  --no-report --print | grep -v "SET NAMES utf8" |
  go run shard_executor.go -db-user=username
  -db-password=password -db-host=localhost
  -db-charset=utf8mb4 -log "/var/tmp/shard.log"
  -db-name=etsy_shard -threads=8

Note that pt-query-digest leaks a small amount of memory continually and will need to be restarted every so often.

All errors were logged to disk and we now had visibility into exactly what needed fixing:

testdb> cut -c 21- /var/tmp/shard.log | grep "^Error" |
  sort | uniq

Along with fixing bad queries, we also needed to modify a large number of tables to change the character-set for VARCHAR and TEXT columns from UTF8 to UTF8MB4 in order to accommodate 4-byte UTF-8 characters. We generated the ALTERs with this one-off script and Schemanator allowed us to execute all of these without any downtime!

Note that the script does not generate ALTER statements for indexed VARCHAR/TEXT columns longer than 191 characters for reasons explained above. In our case, we worked around it by reducing the column lengths where feasible, and using the hashing method otherwise.

A few weeks of bug squashing and schema changes later, our databases and applications were finally ready to accept sql_mode= 'ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES'.

CHECK YOUR FLAGS!

We strongly encourage everyone running MySQL to review the various sql_mode flags and choose the ones that make the most sense for their set up.

'ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend.

We don’t use any of MySQL’s DATE/TIME data-types nor any Engine other than InnoDB in our schemas, but if you do, we strongly suggest also using 'NO_ZERO_IN_DATE,NO_ZERO_DATE,NO_ENGINE_SUBSTITUTION'

References and resources

You can follow Keyur on Twitter at @keyurdg

9 Comments

Java: Not Even Once

Posted by on March 18, 2013 / 10 Comments

Note: This post was co-written by Avleen Vig (@avleen) and Zane Lackey (@zanelackey).

In mid-January of this year we started an initiative to remove Java browser plugins from all employee systems at Etsy, as we feel this is a best practice to be striving towards. To that end, we wanted to discuss the challenges we encountered when removing Java browser plugins in the hope that it will help other organizations with the removal process.

The first question we needed to answer before removing Java was “Who actually needs Java in their browser?” Most organizations face the unfortunate reality that some number of internal systems use Java applets, and that there are groups which use these systems on a daily or weekly basis. In our case, our operations team needed Java to be able to access a group of internal network appliances, such as IPMI console interfaces. Once we identified these requirements, we disabled Java browser plugins across the rest of the organization and set about trying to engineer a safer way for the team to access these network appliances.

Initially, we looked at three approaches:

  1. Installing the Java plugin in a dedicated browser/browser profile and asking the team to use it to only access the network appliances.
  2. Writing a wrapper script to copy the Java plugin into the plugins directory, launch a dedicated browser, then remove the Java plugin when the browser closes.
  3. Using nssecurity to whitelist the hosts that could instantiate Java in the browser.

However, all of these approaches didn’t fulfill our design goals that the approach be safe by default, be easy to maintain, and ideally wouldn’t require the Java browser plugin on the teams laptops at all.

We realized the only way to approach the situation that met our requirements would be to have Java installed in a dedicated, controlled, and isolated environment.

This model is similar to a bastion host, or “jump box”. We opted to use NoMachine as the remote desktop protocol because of the increased performance and usability over low latency links. We have operations engineers located in multiple countries and we also occasionally need to use 3G/4G mobile services to diagnose problems, so this was critically important.

The installation method we followed for CentOS 6 and FreeNX was:

  1. Install the required packages. All of these come from the standard CentOS repos:
    yum install -y jre freenx gnome-desktop gnome-session 
        gnome-panel nautilus firefox fre

    (After installation, be sure to replace the provided SSH keys: https://help.ubuntu.com/community/FreeNX#Using_custom_SSH_keys)

  2. Create a symlink from the Java browser plugin, to the plugins directory used by Firefox:
    ln -s /usr/java/latest/lib/amd64/libnpjp2.so
    /usr/lib64/mozilla/plugins/libnpjp2.so
  3. Now create
    /etc/nxserver/node.conf

    with these contents:
    ENABLE_PASSDB_AUTHENTICATION="1"
    NX_LOG_LEVEL=7
    NX_LOGFILE=/var/log/nx/nxserver.log
    SESSION_LOG_CLEAN=0
    COMMAND_MD5SUM="md5sum"

FreeNX is now configured!

Each of your users who wants to use the new system needs to do the following steps once to add their user to the NX authentication database:

(NOTE: These steps below follow the Ubuntu FreeNX instructions of using a shared account/SSH key for access to the jump system. In this circumstance the risk was accepted as the network appliances that are the end target also use a shared account, so no additional risk was introduced. Obviously different circumstances will have different security requirements.)

  1. Copy /etc/nxserver/client.id_dsa.key to your local machine and save it as ~/.ssh/nx_dsa
  2. On the jump host, run sudo nxserver --adduser <username>
    This adds your account to the nxserver database.  Note: at this step it will also add a key to your .ssh/authorized_keys2 file, if you manage this in your configuration management system it will get overwritten, so you should add the key there.
  3. On the server again, run sudo nxserver --passwd <username>
    This sets your password in the nxserver database
  4. Download the “NoMachine Player v4″ from: http://www.nomachine.com/preview/download-package.php
  5. Start the player, click “New connection”.
    1. Name: jumpbox
    2. Host: jump.your.domain
    3. Select: Use the NoMachine login, then click the ... button
    4. Check Use an alternate key and point it to ~/<username>/.ssh/nx_dsa
    5. Press the X twice and then click the jumpbox connection you see.
    6. Enter your login details.
    7. Click Create a new session
    8. Click Create a new GNOME virtual desktop
    9. You should get a gnome desktop, with a firefox icon.

Using the jump system approach, we’re now able to firewall off these hosts from Internet browsing, we can re-image them on a frequent basis, and we have a very small number of hosts that need to receive Java updates. This approach also allows us to configure our endpoint management software to nuke Java from orbit if the browser plugin ever shows up on employee systems. While we generally abhor single points of failure in our infrastructure, we felt comfortable in this case because the use of configuration management meant that the environment could be quickly rebuilt if needed. Finally, this approach could also be used to isolate other inherently risky technologies when they are still required by certain groups.

In closing, we genuinely hope this helps other organizations facing similar challenges on the road to removing Java browser plugins from the enterprise.

10 Comments