Leveraging Big Data To Create More Secure Web Applications

Posted by on June 4, 2013 / No Responses

Here at Etsy, we take measuring things very seriously. We have previously discussed how we harness data to make decisions about software, operations and products, but we have said little about just how useful data can be for the information security practices of a modern web application.

Our Data Team has written several blog articles about how we build, maintain and take advantage of our data infrastructure. Over the years, our data stack has grown to consist of several technologies, including Hadoop and Cascading (which we’ve written about in the past.) At a high level, Cascading leverages Hadoop’s infrastructure while abstracting standard data processing operations, such as splits and joins, away from the underlying mapper and reducer tasks. Cascading allows us to write analytics jobs quickly and easily in a familiar languages (we use both the JRuby and Scala DSL‘s for Cascading). With a mature Hadoop stack and a multitude of helpful data engineers, the Security Team at Etsy has been taking advantage of the data stack in increasing amounts over the past few months to strengthen our security posture.

Broadly speaking, there are three main types of security practices in which we utilize big data: reactive security mechanisms, proactive security mechanisms, and incident response security practices.

Reactive Security Mechanisms

Reactive security mechanisms usually consist of real-time event monitoring and alerting. These security mechanisms focus on events that trigger immediate responses based on regular information querying: they query the same data and they query it often.

Some examples of reactive security mechanisms at Etsy are automated Splunk searches and anomaly detection based on StatsD/Graphite logs.

Image

We use saved Splunk searches to identify anomalous patterns in access logs and error logs, such as cross-site scripting and increasing failed log-in rates. These searches typically run once a minute in order to give us a real-time monitoring mechanism. We also built an anomaly detection system based on logs of potentially suspicious events such as failed logins. These two mechanisms are similar with respect to the frequency of data aggregated (constantly) and the frequency of analysis (almost constantly).

As you may guess, reactive security mechanisms cannot be easily implemented in Hadoop. Although Hadoop is fast (isn’t that the point!), the main benefit of using big data infrastructure is to churn through huge quantities of data. For reactive security mechanisms, we want to get as close as possible to real-time results. It wouldn’t be efficient or rational to run this type of analysis on our Hadoop cluster.

Although reactive security mechanisms aren’t performed on our cluster, the initial data gathering step is perfectly suited for Hadoop. Figuring out where the thresholds lie for certain metrics by performing predictive analytics and forecasting on past data is a fantastic way to save time that would previously have been spent over several weeks fine-tuning Splunk queries.

Proactive Security Mechanisms

Proactive security mechanisms seek to reduce attack surface or eliminate entire vulnerability classes. This category includes mechanisms such as content security policy, output encoding libraries or full-site SSL. These mechanisms are intended to improve the long-term security posture of the application, rather than collect data about an on-going attack.

Similarly to reactive security mechanisms, we can use predictive analytics and forecasting in Hadoop to weigh the value of our proactive security mechanisms. For example, when determining if our load balancers could handle activating full-site SSL for all Etsy sellers, we ran Hadoop jobs that analyzed past traffic of our sellers to figure out how many requests were made by sellers that were not HTTPS. Armed with this data, as well as metrics from our load balancers, we were able to push out full-site SSL for sellers without encountering unexpected capacity issues.

require 'helpers/analytics'

analytics_cascade do
  analytics_flow do
    analytics_source 'event_logs'
    tap_db_snapshot 'users_index'

    assembly 'event_logs' do
      group_by 'user_id', 'scheme' do
        count 'value'
      end
    end

    assembly 'users_index' do
      project 'user_id', 'is_seller'
    end

    assembly 'ssl_traffic' do
      project 'user_id', 'is_seller', 'scheme', 'value'
      group_by 'is_seller', 'scheme' do
        count 'value'
      end
    end

    analytics_sink 'ssl_traffic'
  end
end

The Cascading.jruby source code for this task is relatively straightforward.

graph copy

The output of this Hadoop job allowed us to visualize Seller HTTP traffic vs Seller HTTPS traffic.

However, unlike reactive security mechanisms, we can use Hadoop to create proactive security mechanisms. The key difference is that these security mechanisms cannot be used to monitor critical metrics that require immediate response or attention. Fortunately, the results of proactive security mechanisms typically don’t require immediate attention. In the SSL example, before we were able to push out full-site SSL for sellers, we had a daily Hadoop job that would break down request patterns to show us which URLs/patterns were requested most often over HTTP. We used this data to iteratively change the scheme of these high value URLs from HTTP to HTTPS while still having data to support the fact that this wouldn’t surpass our restriction on load-balancer terminated SSL connections.

Incident Response

Web application incident response is something that is done often in practice, as it is not limited to responding to full blown compromise. We often need to investigate a threat that has recently targeted organizations that are similar to our own or investigate a new exploit that may affect our technology stack. Typically we are looking for identifying patterns such as URL patterns or IP addresses that have repeatedly accessed our application. Since similar incident response actions are performed frequently, our incident response practices need to be repeatable and, since an incident can occur at any time, we must be able to get the results of our analysis quickly. Even though we want our analysis to be generalizable, it is dependent on the particular threat we are responding to and thus the parameters often need to be changed.

Given all of these conditions, incident response is a perfect example of when to use big data. Incident response is ad-hoc analysis of a large dataset that is driven by an event or incident. We are not going to do it more than once and it needs to be fast. This is a textbook use-case of Hadoop and we take advantage of it constantly for this purpose. Writing template Hadoop jobs that scan our access logs for visits from target IP address or visits to known malicious URL patterns that are easily pluggable with new incident details has proved invaluable to our incident response practices.

Conclusion

The security posture of an application is directly proportional to the amount of information that is known about the application. Big data can be a great source of this kind of information and can be used to gather data to create reactive security mechanisms, gather data to create proactive security mechanisms, directly create new proactive security mechanisms, and to perform incident response.

Although the advantages of analytics from a data science perspective are well-known and well documented, the advantages of analytics from a security perspective have not been explored in-depth. We have found big data to be extraordinarily useful in both creating reactive and proactive security mechanisms, as well as to aiding in incident response. We hope that this will help other organizations in using their data and analytics capabilities more effectively.

Want to know more about how Etsy uses Hadoop to create a more secure web application? The authors of this blog post (Mike Arpaia & Kyle Barry) will be presenting a more in-depth discussion of this topic at the upcoming Black Hat USA conference and Nordic Security Conference. If this has been of interest we hope you’ll get a chance to check out our presentations.

You can follow Mike on Twitter at @mikearpaia and you can follow Kyle on Twitter at @allofmywats.

No Comments

Culture Hacking With A Staff Database

Posted by on May 31, 2013 / 6 Comments

When your startup is just you and the cat, your company communication problems are largely whether or not the cat listens when you tell it to get off your keyboard. Scale to a handful of people and you still know everyone, what they’re working on, where they are, and how to get hold of them when you need them. As you scale further, communication becomes increasingly complex. And then Mother Nature throws a hurricane at you and you really need to know that everyone is safe and well, and do it quickly and efficiently.

We’ve tried to solve “who” problems with an open, accessible staff database.

Free your staff data

Most companies will have some form of HR system with all sorts of useful information in it, but much of that is private and confidential, and locked away (and rightly so). What’s needed is something more basic: Who is this person? What do they look like? Where are they? What do they do? How to I get in touch with them?

But, because we took our HR data and re-published the “safe” bits to a database, the staff directory data is completely at our mercy. We’re not subject to what the designers of our HR information system think people should look at. And if what we need changes, we can adapt to suit.

Even better, because the data is in the same ecosystem as our regular web site data, anyone who knows how to build something for etsy.com can also build something with our staff data with no particular extra knowledge. Making it easy to hack (unsurprisingly) makes for more hacks, and more hacks allows for useful and unexpected tools to crop up to help you scale better, without any particular central planning.

Who are you?

The obvious thing to build on top of this database is a “staff directory”. We have one, it’s great. It lives with our other internal web-based support tools, easily available to all. For an organisation merrily blazing past Dunbar’s number, it’s critical in helping us maintain a sort of “outboard brain” for who’s who in the company. (This is an ongoing problem, since we’re still hiring.)

Extending that, during a past Hack Week, a group of us built a game where you were given a photo of someone, and multiple choices for who they were. Add to that score tracking and a leader board and we had people enthusiastically learning who everyone in the company was. (It was too easy for some people, so we quickly developed a mode where you had to actually type in the right name – the game equivalent of meeting them in a hallway.)

guess2

Smooth small bumps as well as large ones

Hacks don’t have to be complex to be useful. Adding a “You’ve Got Mail” button to the directory simplified the process of sorting the incoming physical mail to the office and meant that people didn’t have to go and check the mail room regularly. Being able to smooth out even minor niggles like this contributes to the feeling of an office that just works.

They don’t even have to be permanent features to be useful – one-off queries can be valuable too. At one point, I needed to work out which of my colleagues sold vintage items on Etsy in order to do some research on a new feature. I could have emailed everyone and hoped to get replies, but instead I could extract the list of staff shops from the directory, and then extract their listings from the listings table, and work out exactly who sold vintage.

Are you OK?

A rather more acute and impromptu use for the data came during Hurricane Sandy, when we wanted to be able to do a roll call to make sure our colleagues were safe and sound. It was trivial to take the data, export name, location, phone numbers, and team to CSV and import it into Google Docs, and then we had a shared document where we could track how everyone was.

At the same time, we wanted everyone to be able to get everyone’s contact details into their phones easily in case we needed to get hold of people quickly. Some quick checking of Google and Wikipedia produced the vCard spec and, while not cutting edge, the 3.0 version is very amenable to quick hacking. (More recent versions are fine too, of course, but 3.0 hit the spot for the amount of time and energy I had available for hacking.)

There are many, many fields in the full spec, but it’s the work of moments to identify the main ones we care about: name, email, phone numbers, and organization.We also happened to have a field for Skype nicks, so I went ahead and added that. (Staff members have voluntarily entered their phone numbers in our staff database, and the vCards are only accessible for the staff directory, so we felt safe that these vCards would be acceptable. You should always consider your company’s data sources and culture, as well as information security issues when hacking on company data.)

In addition, we have photos of many of our staff members on the about us page, so I did some quick work to include a photo where we had one. (The vCard spec allows you to embed photos as base64-encoded strings.) Nothing complicated, and only a few minutes to implement, but it adds a nice polish to the cards.

We can generate sets of vCards for departments and teams, the whole organisation, just new hires, etc., to make managing your contacts easier.

Call me, maybe

The last hurricane-inspired hack we did was an emergency broadcast system so that we could quickly update staff members by SMS about rapidly-changing events or urgent alerts (such as unplanned office closures). We already have a Twilio account, along with PHP libraries to use it so all that was required was to validate the phone numbers to make sure they were valid US numbers, and provide a little bit of code to filter the directory by location (we only want to SMS Brooklynites if the Brooklyn office is closed, for example) and the actual sending is simply sending a suitable request to the Twilio REST API. The implementation also included plenty of error logging and reporting and the like because it’s important to know who you didn’t alert too.

megaphone2

Future work may include extending the system to accept replies, so that we can get the bulk of roll calls done quickly and be able to concentrate on finding and helping those who need it.

When HR systems go to the dogs

One of the notable features of our staff directory’s database schema is the “human” column. The office dog pack is a sufficiently important and intrinsic part of the office culture that it was considered necessary to include them in the database. The past Hack Week saw an expansion of this with a full Doggie DB hack that lets you get to know the pack better. And it might seem trivial on the surface, but that’s because you haven’t had to use the “You’ve Got Poop” button….

doggiedb2

All these hacks were (a) easy, (b) unforeseen when we started the staff database project, and (c) helped us in growing the business and keeping our colleagues safe and informed. Grab your company’s data and see what you can do.

You can follow Ian on Twitter at @indec.

6 Comments

At Percona MySQL Conference This Week

Posted by on April 23, 2013 / 2 Comments

A few of us are at the Percona MySQL Conference this week in Santa Clara. Please come say hello, and if you have a chance today stop by the talk about development at scale at 1:20PM today Tuesday the 23rd in Ballroom B.

2 Comments

March 2013 Site Performance Report

Posted by on April 9, 2013 / 9 Comments

Four more months have gone by since our last update, and it’s time for another performance report!  We made it through the holiday season with flying colors, and our engineers have been hard at work launching new features since the new year started.  Let’s see what impact this had on performance (spoiler: we keep getting faster).

Server Side Performance:

Here are the median and 95th percentile load times for signed in users on our core pages on Thursday, 3/14/13:

Server Side Performance Chart

As you can see we had small decreases in load time across the board (the one exception being a tiny uptick in median search performance).  We expected this, since we made a couple of small changes to our application code in the last couple of months that had a positive impact across every page.  The larger decrease in the 95th percentile load time for the homepage was primarily due to the removal of the Taste Test module.

As a quick reminder, the “Baseline” page is an extremely simple page that just includes our header and footer, and uses our normal controller architecture.  Improvements in the load time of this page mean improvements across every page on Etsy.

We also thought it would be fun to look back and compare our performance today to our performance from the very first performance report, back in August of 2011.  Since we don’t look at averages anymore, we can only compare the 95th percentile:

Performance comparison 1.5 years

Looking good!  We have made huge strides on the homepage, shop page, and profile page, and a small improvement on the search page.  The listing page hasn’t seen much optimization given its position as the fastest of these pages, so we’re happy with its modest gains.  It’s extremely gratifying to see how much progress we have made over the last ~18 months, and we hope to continue this trend in the months to come.

Front-end Performance:

We changed our strategy for monitoring front-end performance in our last update, and promised more stable metrics going forward.  That proved to be the case:

Frontend Performance - March 2013

Performance looks good here as well – reductions in load time for almost every page type again.  The one outlier here is our baseline page, which saw a significant increase in document complete time since our last update.  We’ve started digging into this, and it looks like a couple of extra HTTP requests have snuck into this page – a JS file and a sprite or two.  We’ll be consolidating these requests in the coming days and we expect to see the document complete time on this page drop back down.

As far as the rest of the improvements go, we believe that they are due to a combination of changes that we’ve deliberately made to improve performance, and a better hit rate at our CDNs.  We are currently load balancing static content across multiple CDNs, and we’ve worked to improve cachability so that fewer requests have to hit our origin.

Conclusion:

Overall we are happy with the way performance has been trending, especially because we’ve been focusing our efforts over the last 4 months on projects that don’t involve these core pages.  For the next couple of quarters our plan is to establish clear performance SLAs and bring all of our major pages into compliance with them.  Beyond that we are always looking for big wins and evaluating new standards like SPDY, WebP, and pre-fetching to see if they make sense for Etsy.  It’s an exciting time to be working on web performance!

Does this sound like something you would like to work on?  Our team is hiring!

You can follow Jonathan on Twitter at @jonathanklein

9 Comments

Re-Exploring New Technologies: The Turbo Encabulator

Posted by on April 1, 2013 / 4 Comments

One of the things that we like to do here at Etsy is to push the envelope.

Pushing the envelope means reaching deep into unknown technologies and approaches, and to experiment in areas that we are not familiar with. It’s in this vein that we’d like to share some early and unfiltered work: The Turbo Encabulator.

Each member of the TE team has stretched their cognitive abilities into new realms, and we’re pretty happy with the results thus far. We know that in many ways, engineering is about solidifying a solution well before the problem statement can be fully described within a socio-technical environment.

Of course, we also acknowledge that the notion of socio-emotional technocracy needs to fit within this paradigm. This brings us to some questions:

Understandably, these challenges are unilateral in their nature and we’ve made a large investment in creating an erinaceous atmosphere in order to produce results.

One of these results is the Turbo Encabulator. As you can imagine, sinusoidal repleneration threw us for a loop. :)
An internal tech talk can be seen below:

4 Comments

There and Back Again: Migrating Geolocation Data to GeoNames

Posted by on March 26, 2013 / No Responses

People are passionate about where they live. At Etsy we need to keep track of lots of different locations, such as buyers’ billing and shipping addresses and sellers’ shop locations. As Etsy continues to expand internationally we wanted to provide better localization and translations for our location place names. We determined that the best way to effect this change was to move from using a closed location API provider to internal services backed by the open GeoNames data set.

Before we could start using GeoNames as our data source we had to map all our existing user locations onto the GeoNames data. There is no established method for creating this mapping so we looked at the data we had and figured out a method. For our existing user locations we had city, region, country, country code, latitude and longitude in a format like this:

tokyo             tokyo prefecture   japan            jp   35.670 139.740
luxembourg city   luxemburg          luxembourg       lu   49.609 6.129
newport           england            united kingdom   gb   50.700 -1.295
earth             texas              united states    us   34.277 -102.508

To execute our mapping we created a PHP script that was able to leverage the much broader amount of data GeoNames provides us and quickly determine if old and new locations matched up. This script is now available on Etsy’s github page. The script relies on a MySQL DB backend that contains the standard data tables distributed by GeoNames with indexes on place names, country codes, and the latitude and longitude columns.

For each row in our source data we attempt a few different strategies to try to find a match to the GeoNames data.

Our first step is to see if there are any exact text matches to the place name we have in the GeoNames data set. If there is one or more exact matches we sort them by distance from the source latitude and longitude and if the nearest place is less than 20km away we call it a match. 20km is a rather large radius and we could easily run through with much lower limits, but we found that since we sorted by distance already a large radius gave us more positive matches when our data sets disagreed on where the center of large cities should be.

If we don’t find an exact text match we look again twice more, once with wildcards around the place name in our source data and once searching the “alternatenames” column in the GeoNames data set instead of the “name” column. This helps us find locations whose names may have changed or have alternate spellings. Results are sorted by distance and again we keep the closest if it falls within our threshold.

If we still have not found a match we take a substring of the first several letters of the name and do a wildcard search for that. (The length of the substring depends on the length of the source name and is generally about 35%.) This helps eliminate problems resulting from inconsistent inclusion of things like ‘City’, ‘Township’ and ‘Borough’ from place names in our source data.

For this project it was important that we found a match for every location and so we added a final “match at all costs” step.  If none of the previous steps have succeeded we use the source location’s latitude and longitude and get a list of all locations in the source country that are within a few tenths of a degree.  We then sort the results and pick the closest.  This was necessary for less than 1% of our data and allowed us to continue providing local search services even when we were unable to match a city exactly. This was a very exploratory, iterative process as we discovered what worked and what did not, and where our existing data was incomplete or inaccurate. From tuning the radius of the search areas, to dropping all the region data outside of the US (we found that it was contributing a lot of false positive matches, and the results were better without it), we revised and refined the logic until it gave us satisfactory results.

After we established a mapping we added a new field to store each seller’s “GeoNameID” in our DB. Where previously we stored all the fields listed above now we need only the GeoNameID.  We do still store the old data format as well which has allowed us to make an easy and transparent transition from using the old data to the new data.  With access to the full breadth of data provided by GeoNames we have been able to speed our Local Search feature.  We now also localize spellings of place names for our international users, so someone searching in German can search near Köln while someone searching in English can look for Cologne.

I will be giving a talk on how we handle search and localized place name auto-suggest at Lucene/Solr revolution in San Diego in May and we’ll be publishing more Code as Craft blog posts on geolocation services in the coming months here as well.

You can follow John Marc on Twitter @thejohnmarc

No Comments

Data Corruption To Go: The Perils Of sql_mode = NULL

Posted by on March 19, 2013 / 9 Comments

A little while back, we hit an extremely embarrassing issue: 32-bit signed integer overflow on some primary keys.

In our architecture, an unsigned 64-bit key is generated by a global ticket server, but the sharded tables’ schemas did not use BIGINT UNSIGNED as the column data-type. Inserts into the affected tables started failing en masse with “duplicate key” errors on key 2147483647.

We quickly realized the issue and ALTERed the impacted tables, but the MySQL behavior of silently truncating all values larger than 2147483647 with the only fail-safe being the primary key constraint was worrying. Any columns that lacked a similar constraint would be experiencing silent data corruption.

Some digging led us to the MySQL variable sql_mode.

What is sql_mode anyway?

By default MySQL is very accepting of bad data and bad queries. At best, it may emit a WARNING, but these are almost always ignored by client libraries such as PHP’s PDO.

sql_mode is a run-time setting for MySQL and enabling it makes the database much more vocal by changing most WARNINGs to ERRORs. This helps prevent silent data corruption on write, and unexpected results on read. Here are a few examples of issues it helped fix within our stack and the settings we used:

PROBLEM: Missing default values, data-type and data-length mismatches

CREATE TABLE `sql_test`
    `id` TINYINT UNSIGNED NOT NULL PRIMARY KEY,
    `value` VARCHAR(3) NOT NULL,
    `column_not_null` VARCHAR(1) NOT NULL
) Engine = InnoDB;

Consider the following inserts:

  1. INSERT INTO sql_test (id, value, column_not_null) VALUES (10000, 'z', 'y');
  2. INSERT INTO sql_test (id, value, column_not_null) VALUES (1, 'abcde', 'x');
  3. INSERT INTO sql_test (id, value) VALUES (2, 'qwerty');

Here’s what the table would look like when sql_mode is disabled:

+-----+-------+-----------------+
| id  | value | column_not_null |
+-----+-------+-----------------+
|   1 | abc   | x               |
|   2 | qwe   |                 |
| 255 | z     | y               |
+-----+-------+-----------------+

Massively different from the expected values!

When sql_mode='STRICT_ALL_TABLES' is set though, each of those inserts will fail hard so that you can recover and handle the problem in your application.

  1. ERROR 1264 (22003): Out of range value for column 'id' at row 1
  2. ERROR 1406 (22001): Data too long for column 'value' at row 1
  3. ERROR 1364 (HY000): Field 'column_not_null' doesn't have a default value

PROBLEM: Character-set issues

CREATE TABLE `charset_test` (
    `message` VARCHAR(3000) NOT NULL -- NB: this could also be `message` TEXT NOT NULL
) Engine = InnoDB DEFAULT CHARSET=UTF8;

Inserting a supplementary UTF-8 character string like UTF8 supplementary character example would seem to work, but on a subsequent SELECT, the returned string would be quite different: UTF8 Supplementary Fail

This is because MySQL’s UTF8 implementation can only accommodate characters up to 3 bytes in length, but Unicode characters could be up to 4 bytes long. The data stored in the message column is subtly truncated and corrupted.

With the stricter sql_mode='STRICT_ALL_TABLES', inserts containing 4-byte characters will fail hard with: ERROR 1366 (HY000): Incorrect string value

Note that sql_mode setting here is only an alarm-bell, it is not fixing the underlying issue. The real solution is twofold:

  1. Change the character-set of the connection to MySQL to UTF8MB4. If using PHP’s PDO, pass it in via the connection DSN. Or when running queries by hand, make sure to execute SET NAMES utf8mb4 before executing anything.

  2. Modify the column and the table definitions to UTF8MB4:
    ALTER TABLE `charset_test`
        DEFAULT CHARSET utf8mb4,
        MODIFY `message` VARCHAR(3000) CHARACTER SET utf8mb4 NOT NULL;

One caveat to note when using the UTF8MB4 charset: the rule-of-thumb for an indexed VARCHAR column being at most 255 characters is no longer valid. MySQL has a length limit of 767 bytes for the index prefix and it has to accommodate the worst case. Thus when using the 3-byte UTF8 character set, 255 characters fit in, but when using 4-byte UTF8MB4 only 191 characters fit.

If we absolutely need 255 characters indexed (for example if the column holds email addresses), a workaround is to SHA-1 hash the value, and store and index that column instead.

PROBLEM: Unsigned int decrements

CREATE TABLE `decrement_test` (
    `id` int(11) NOT NULL,
    `value` int(11) unsigned NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

INSERT INTO decrement_test(id, value)
    VALUES(1, 20) ON DUPLICATE KEY UPDATE
    value = value + VALUES(value);

SELECT * FROM decrement_test;
+----+-------+
| id | value |
+----+-------+
| 1  | 20    |
+----+-------+

INSERT INTO decrement_test(id, value)
    VALUES(1, -1) ON DUPLICATE KEY UPDATE
    value = value + VALUES(value);

If you were expecting the value to now be 19, you would be wrong:

SELECT * FROM decrement_test;
+----+-------+
| id | value |
+----+-------+
| 1  | 20    |
+----+-------+

When sql_mode='STRICT_ALL_TABLES' is enabled, the decrement INSERT statement will fail with: ERROR 1264 (22003): Out of range value for column 'value' at row 1

Note that in order to do atomic decrements as shown above, the column data type would need to be signed-integer rather than unsigned-integer.

PROBLEM: Incorrect group-by

CREATE TABLE `group_by_test`
    `id` INT NOT NULL PRIMARY KEY,
    `name` VARCHAR(30) NOT NULL,
    `status` INT NOT NULL,
    `region` INT NOT NULL
) Engine = InnoDB DEFAULT CHARSET = UTF8MB4;

INSERT INTO group_by_test(id, name, status, region) VALUES (1, 'etsy', 10, 100);
INSERT INTO group_by_test(id, name, status, region) VALUES (2, 'ebay', 12, 100);
INSERT INTO group_by_test(id, name, status, region) VALUES (3, 'amazon', 13, 150);
INSERT INTO group_by_test(id, name, status, region) VALUES (4, 'fab', 10, 100);

SELECT COUNT(1), status, name FROM group_by_test GROUP BY status;
+----------+--------+--------+
| COUNT(1) | status | name   |
+----------+--------+--------+
|   2      | 10     | etsy   |
|   1      | 12     | ebay   |
|   1      | 13     | amazon |
+----------+--------+--------+

The `name` column isn’t in the GROUP BY clause, but is in the SELECT and MySQL pulls the first matching value and puts it into the result set. The output in the `name` column depends on relative order of rows in the table and may not be repeatable and this is almost never desirable.

By setting sql_mode='ONLY_FULL_GROUP_BY', we can prevent such unreliable SQL from executing. The query would instead fail with: ERROR 1055 (42000): 'etsy.group_by_test.name' isn't in GROUP BY

Just throw the switch and test in Production?

Unfortunately our application was continually executing some variations of the bad queries shown above, especially the bad GROUP BY and bad DEFAULT/NOT NULL, and we couldn’t just enable the mode in Production without causing a lot of disruption.

We had to find the bad queries and table definitions first. In order to do that, we set up a test database with the Production schema on it and enabled the strict sql_mode='ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' on it.

We slurped all queries from our Production DBs using a local tcpdump and netcat’ed them over to the test database:

prodb-shard> sudo /usr/sbin/tcpdump -i eth0
  "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2"
  -s 65535 -x -n -q -tttt |
  nc testdb.etsy.com 20000

On the test box, we used the combination of Percona Toolkit’s pt-query-digest and a custom Go script to execute these queries (NB: Percona Playback did not exist at the time). Our Production servers execute ~4000 queries/sec (~30000 packets/sec in and out) and while pt-query-digest has a built-in execute mode, it cannot keep up with the volume and the tcpdump capture would drop packets. The Go script helped multiplex this query volume and keep packet loss low (~1%).

testdb> nc -l 20000 | pt-query-digest --type tcpdump
  --no-report --print | grep -v "SET NAMES utf8" |
  go run shard_executor.go -db-user=username
  -db-password=password -db-host=localhost
  -db-charset=utf8mb4 -log "/var/tmp/shard.log"
  -db-name=etsy_shard -threads=8

Note that pt-query-digest leaks a small amount of memory continually and will need to be restarted every so often.

All errors were logged to disk and we now had visibility into exactly what needed fixing:

testdb> cut -c 21- /var/tmp/shard.log | grep "^Error" |
  sort | uniq

Along with fixing bad queries, we also needed to modify a large number of tables to change the character-set for VARCHAR and TEXT columns from UTF8 to UTF8MB4 in order to accommodate 4-byte UTF-8 characters. We generated the ALTERs with this one-off script and Schemanator allowed us to execute all of these without any downtime!

Note that the script does not generate ALTER statements for indexed VARCHAR/TEXT columns longer than 191 characters for reasons explained above. In our case, we worked around it by reducing the column lengths where feasible, and using the hashing method otherwise.

A few weeks of bug squashing and schema changes later, our databases and applications were finally ready to accept sql_mode= 'ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES'.

CHECK YOUR FLAGS!

We strongly encourage everyone running MySQL to review the various sql_mode flags and choose the ones that make the most sense for their set up.

'ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend.

We don’t use any of MySQL’s DATE/TIME data-types nor any Engine other than InnoDB in our schemas, but if you do, we strongly suggest also using 'NO_ZERO_IN_DATE,NO_ZERO_DATE,NO_ENGINE_SUBSTITUTION'

References and resources

You can follow Keyur on Twitter at @keyurdg

9 Comments

Java: Not Even Once

Posted by on March 18, 2013 / 10 Comments

Note: This post was co-written by Avleen Vig (@avleen) and Zane Lackey (@zanelackey).

In mid-January of this year we started an initiative to remove Java browser plugins from all employee systems at Etsy, as we feel this is a best practice to be striving towards. To that end, we wanted to discuss the challenges we encountered when removing Java browser plugins in the hope that it will help other organizations with the removal process.

The first question we needed to answer before removing Java was “Who actually needs Java in their browser?” Most organizations face the unfortunate reality that some number of internal systems use Java applets, and that there are groups which use these systems on a daily or weekly basis. In our case, our operations team needed Java to be able to access a group of internal network appliances, such as IPMI console interfaces. Once we identified these requirements, we disabled Java browser plugins across the rest of the organization and set about trying to engineer a safer way for the team to access these network appliances.

Initially, we looked at three approaches:

  1. Installing the Java plugin in a dedicated browser/browser profile and asking the team to use it to only access the network appliances.
  2. Writing a wrapper script to copy the Java plugin into the plugins directory, launch a dedicated browser, then remove the Java plugin when the browser closes.
  3. Using nssecurity to whitelist the hosts that could instantiate Java in the browser.

However, all of these approaches didn’t fulfill our design goals that the approach be safe by default, be easy to maintain, and ideally wouldn’t require the Java browser plugin on the teams laptops at all.

We realized the only way to approach the situation that met our requirements would be to have Java installed in a dedicated, controlled, and isolated environment.

This model is similar to a bastion host, or “jump box”. We opted to use NoMachine as the remote desktop protocol because of the increased performance and usability over low latency links. We have operations engineers located in multiple countries and we also occasionally need to use 3G/4G mobile services to diagnose problems, so this was critically important.

The installation method we followed for CentOS 6 and FreeNX was:

  1. Install the required packages. All of these come from the standard CentOS repos:
    yum install -y jre freenx gnome-desktop gnome-session 
        gnome-panel nautilus firefox fre

    (After installation, be sure to replace the provided SSH keys: https://help.ubuntu.com/community/FreeNX#Using_custom_SSH_keys)

  2. Create a symlink from the Java browser plugin, to the plugins directory used by Firefox:
    ln -s /usr/java/latest/lib/amd64/libnpjp2.so
    /usr/lib64/mozilla/plugins/libnpjp2.so
  3. Now create
    /etc/nxserver/node.conf

    with these contents:
    ENABLE_PASSDB_AUTHENTICATION="1"
    NX_LOG_LEVEL=7
    NX_LOGFILE=/var/log/nx/nxserver.log
    SESSION_LOG_CLEAN=0
    COMMAND_MD5SUM="md5sum"

FreeNX is now configured!

Each of your users who wants to use the new system needs to do the following steps once to add their user to the NX authentication database:

(NOTE: These steps below follow the Ubuntu FreeNX instructions of using a shared account/SSH key for access to the jump system. In this circumstance the risk was accepted as the network appliances that are the end target also use a shared account, so no additional risk was introduced. Obviously different circumstances will have different security requirements.)

  1. Copy /etc/nxserver/client.id_dsa.key to your local machine and save it as ~/.ssh/nx_dsa
  2. On the jump host, run sudo nxserver --adduser <username>
    This adds your account to the nxserver database.  Note: at this step it will also add a key to your .ssh/authorized_keys2 file, if you manage this in your configuration management system it will get overwritten, so you should add the key there.
  3. On the server again, run sudo nxserver --passwd <username>
    This sets your password in the nxserver database
  4. Download the “NoMachine Player v4″ from: http://www.nomachine.com/preview/download-package.php
  5. Start the player, click “New connection”.
    1. Name: jumpbox
    2. Host: jump.your.domain
    3. Select: Use the NoMachine login, then click the ... button
    4. Check Use an alternate key and point it to ~/<username>/.ssh/nx_dsa
    5. Press the X twice and then click the jumpbox connection you see.
    6. Enter your login details.
    7. Click Create a new session
    8. Click Create a new GNOME virtual desktop
    9. You should get a gnome desktop, with a firefox icon.

Using the jump system approach, we’re now able to firewall off these hosts from Internet browsing, we can re-image them on a frequent basis, and we have a very small number of hosts that need to receive Java updates. This approach also allows us to configure our endpoint management software to nuke Java from orbit if the browser plugin ever shows up on employee systems. While we generally abhor single points of failure in our infrastructure, we felt comfortable in this case because the use of configuration management meant that the environment could be quickly rebuilt if needed. Finally, this approach could also be used to isolate other inherently risky technologies when they are still required by certain groups.

In closing, we genuinely hope this helps other organizations facing similar challenges on the road to removing Java browser plugins from the enterprise.

10 Comments

Schemanator: Love Child of Deployinator and Schema Changes

Posted by on January 11, 2013 / 14 Comments

We’ve previously written about our sharded master-master pair database architecture and how that and the Etsy ORM allows us to perform schema changes while keeping the site up. That we can do this at all is really awesome but to actually do this is still hard, risky and time-consuming. We run the site on half of our database servers while we make schema changes to the other half. Then we switch sides and do it again. It’s many steps on many machines and it’s happening while 100+ developers are pushing code. Let’s take a closer look at how we pull this off and the automation we’ve developed around this process.

I want to pause to acknowledge that being able to run the site on half of our database servers is in itself the result not only of good design in the ORM but of good ongoing capacity planning in making sure we can carry a full load on half the servers.

To understand Schemanator let’s first take a look at how we did schema changes “by hand”. To complete a schema change we need to change the ORM’s configuration a total of 4 times – first to take one half of the databases out of production so we can apply change to them and then to put them back. Then we take the other half out and apply changes and finally put them back. At Etsy we can deploy relatively quickly – generally less than 20 minutes from commit to live — and we have an even more streamlined process for updating just the application configuration — a “config push” as we call it. The challenge with this comes from the fact that at any point during a day, people are waiting their turn to deploy code in our push queue. To do our schema change, we’ll need to wait through the push queue 4 times. Not hard, but not amenable to getting things done quickly.

Once we’ve done the first config push to pull out half the DBs, we’re ready to apply the schema changes to the set of dormant servers. For this we’d use some shell-fu in a screen or tmux session to run commands on all the databases at once. When those changes were done, we’d get back in the push queue to put the updated DBs back in production, watch that it’s stable and then take the other side out. Then go back to the terminal, connect to all the other databases and run the changes on them. When that’s done, it’s back to the push queue to return the ORM to its full configuration. Along the way we’d have to deal with Nagios, run checksums on the new schemas on each host, and monitor DB status and site errors. And at the end we should end up with something we can call a log of the changes we made.

We did versions of that process for a while. It worked but no one is perfect and we did have an outage stemming from errors made while doing schema changes. This incident helped crystalize the “quantum of deployment concept as applied to schema changes” for us. We asked ourselves “What’s the smallest number of steps, with the smallest number of people and the smallest amount of ceremony required to get new schemas running on our servers?” With an answer to that, we knew we could, and had to, put a button on it.

Thus was born Schemanator.

From Continuous Deployment to Concurrent Deployment

From the description of the schema change process above it’s clear that one of the pain points is having to wait in the push queue to deploy configuration changes. In order to be truly awesome, Schemanator would have to provide an avenue for the operator to bypass the push queue altogether. In practical terms, not having to wait in the queue shaves at least an hour off the whole process. This part of Schemanator is the first part we wanted to address since it delivered the most overall time savings in the shortest time.

To help understand how we made this part work here’s a bit more about our ORM setup. On each request, the configuration file is read in. Part of that configuration is the set of DSNs for all available database servers. We store these in a hash keyed by the shard number and side like this:

$server_config["database"] = array(
    'etsy_shard_001_A' => 'mysql:host=dbshard01.etsy.com;port=3306;
dbname=etsy_shard;user=etsy',
    'etsy_shard_001_B' => 'mysql:host=dbshard02.etsy.com;port=3306;
dbname=etsy_shard;user=etsy',
    'etsy_shard_002_A' => 'mysql:host=dbshard03.etsy.com;port=3306;
dbname=etsy_shard;user=etsy',
    'etsy_shard_002_B' => 'mysql:host=dbshard04.etsy.com;port=3306;
dbname=etsy_shard;user=etsy',
    ...
);

Before Schemanator, we would literally just comment out the lines with the DSNs we didn’t want the ORM to use. We didn’t want to automate commenting out lines of code in our config file, and even if we did, doing so would have still left us having to wait in the push queue. So we made a slight change to the ORM setup routine. When the ORM sets up, it now also checks if a special file — a “disabled connections” file exists. If it does, it’s read in as a list of DSNs to ignore. Those servers are skipped when the ORM sets up and the application ends up not using them. Since the config is read at the begining of the web request and PHP is shared nothing, once the disabled connections file is in place, all subsequent requests on that server will respect it.

By carefully deploying just that one special disabled connections file into the live docroot we get the changes we need to the ORM…and we can do this while people are pushing code changes to the site. Specifically, we updated our general deploy to exclude the disabled connections file to eliminate possible race conditions on deploy and we set up Deployinator to allow us to deploy the disabled connections file on its own.

But there was a problem. Like many php sites we use APC for opcode caching. Part of each deploy included gracefully restarting Apache to clear the opcode cache. Dropping the one file in is all well and good but we’d still have to clear it from APC. After considering some options, we chose to turn on apc.stat. This tells APC to stat each file before returning the cached opcodes. If the file is newer than the cached opcodes then re-read the file and update the cache. We run our docroot from a RAM disk so the extra stats aren’t a problem. With apc.stat on we could drop in our disabled connections file and the next request will start using it. No restarts required. We did need to increase the size of our APC cache to allow this. We were able to stop doing restarts on each deploy and since most deploys only change a small subset of our codebase, we saw an improvement in our cache-hit ratio. With this process in hand, and after much testing, we were able to allow the disabled connections file to be deployed concurrently with our regular deploys.

We call this part of Schemanator “Side Splitter”. It provides a Web GUI for selecting which DB servers to disable and a button to deploy that configuration. Pushing the button writes and deploys the disabled connections file. There are also a number of sanity checks to make sure we don’t do things like pull out both sides of the same shard.  We use this not just as part of schema changes, but also when we need to pull DBs out for maintenance.

Schemanator UI

UI For selecting which sides of which shards to disable.

Ch-ch-changes

Next we needed a process for applying the schema changes to the databases. A number of the choices we had to make here were strongly influenced by our environment. Etsy is in PHP and we use Gearman for asynchronous jobs so using anything but PHP/Gearman for Schemanator would have meant re-inventing many wheels.

There are many things that could possibly go wrong while running schema changes. We tried to anticipate as many of these as we could: the operator’s internet connection could die mid-process, the application could keep connecting to disabled databases, gearman workers could die mid-job, the DDL statements might fail on a subset of servers, etc., etc. We knew that we couldn’t anticipate everything and that we certainly couldn’t write code to recover from any possible error. With that awareness – that something unexpected would eventually happen during schema changes – we designed Schemanator to allow the operator to pause or stop the process along the way. Each sub-task pauses and shows the operator an appropriate set of graphs and monitors to allow them to evaluate if it’s safe to proceed. This increases our liklihood of detecting trouble and gives us a way to bail out of the process, if that is the best course of action.

Central to Schemanator is the concept of the “changeset” – a data structure where we store everything about the schema change: the SQL, who created it, test results, checksums, and more. The changeset acts as both a recipe for Schemanator to apply the changes and a record of the work done.

The early parts of the Schemanator workflow center around defining and testing the changeset. To test a changeset, a gearman job loads up the current schema with no data and applies the changes there. If there are errors the changeset will be marked as failing and the errors reported back. We also generate the checksums we’ll look for later when we apply the changes in production.

We recently added a test that inspects the post-change schema to make sure that we’re keeping to our SQL standards. For example, we make sure all tables are InnoDB with UTF-8 as the default character set, that we we don’t add any AUTO INCREMENT fields, which would be trouble in our Master-Master setup. We recently had an issue where some tables had a foreign key field as INT(11) but the related table had the field as BIGINT. This caused errors when trying to store the BIGINT in the INT(11) field. To catch this going forward, Schemanator now checks that our keys are all BIGINTs. Once the changeset passes all the tests, it can be run.

When Schemanator runs a changeset, the first thing it does is to tell Nagios not alert for the servers we’re about to update. We use xb95′s nagios-api which allows Schemanator to set and cancel downtime. Schemanator’s “Preflight Check” screen shows the current status of the DB cluster and also checks Nagios to see if any of the DB servers have alerts or warnings. This gives an at-a-glance view to know if it’s OK to proceed.

Schemanator: Preflight Checks

Schemanator: Preflight Checks

Once things look OK, the operator can click the button to “Do it: Pull those sides out”. Schemanator will deploy the first of the 4 disabled connections configs. When that config is live, Schemanator drops you back on a page full of graphs where you can confirm that traffic has moved off of the pulled servers and is being adequately handled by the remaining servers. Once that is stable the operator clicks a button to apply the SQL to dormant servers. This is handled by a set of Gearman jobs – one job per database – that connect to the DBs and apply the updates. Schemanator monitors each job and polls the MySQL Process List on the DBs so the operator has good view of what’s happening on each server. Any errors or checksum mismatches bubble up in the UI so the operator can decide how to deal with them.

The progress of an alter running on a remote DB

Schemanator: Showing the progress of an alter running on a remote DB and the process list from that DB with our alter highlighted in yellow

When all the workers are done, Schemanator prompts the user that it’s ready to move on. From here it’s a matter of repeating these steps with the right variations until the changes have been applied to all the databases. Schemanator handles setting and canceling Nagios downtime, checksumming any tables that were updated and logging everything.

We’ve been using Schemanator to help with schema changes for a few months now and it’s achieved most of the goals we’d hope for: increased the speed and confidence we can do schema change, increased reliability that we don’t forget any of the many steps involved, and freed up people from having to devote often a full day to schema changes. We generally make schema changes only once a week, but with Schemanator, if we have to do them spur of the moment, that’s no longer out of the question. While Schemanator doesn’t do everything, we feel it’s the appropriate level of automation for a sensitive operation like schema changes.

14 Comments

mctop – a tool for analyzing memcache get traffic

Posted by on December 13, 2012 / 9 Comments

Here at Etsy we (ab)use our memcache infrastructure pretty heavily as a caching layer between our applications and our database tiers. We functionally partition our memcache instances into small pools and overall it works fabulously well. We have however suffered occasionally from what we call “hot keys”.

What is a “Hot” key?

A “hot key” is a single key hashed to an individual memcache instance with a
very high get rate, often being called once for every page view. For the most part network bandwidth across all memcache instances within a pool is relatively balanced. These hot keys, however, contribute a significant additional amount of egress network traffic and have the potential to saturate the available network bandwidth of the interface.

Image

The graph above is an example of a recent hot key issue. The graph y-axis represents bytes per second inbound and outbound of memcached01′s network interface.

As we hit peak traffic, memcached01′s network interface was completely saturated at approximately 960Mbps (it’s a 1Gbps NIC). This has a particularly nasty impact to get latency:

Image

As we began to push past 800Mbps outbound, 90th percentile get request latency jumped from 5ms to 35ms. Once the NIC was saturated latency spiked to over 200ms.

Diagnosing the Issue

This wasn’t the first time a hot key had been responsible for unsually high network bandwidth utilization so this was our first line of investigation. Comparatively memcached01′s bandwidth utilization was significantly higher than the other servers in the pool.

Diagnosing which key was causing problems was a slow process, our troubleshooting process took the following steps:

  1. Take a brief 60 second packet capture of the egress network traffic from memcached01
  2. Using the tshark (wireshark’s awesome command line cousin) extract the key and response size from the memcache VALUE responses in captured packet data.
  3. Post process the tshark output to aggregate counts, estimate requests per second and calculate the estimated bandwidth per key.
  4. Sort that list by bandwidth then further investigate that key.

Once the potentially offending key is found we’d repeat this process from a couple of client machines to validate this as the offending key. Once the key was confirmed engineers would look at alternate approaches to handling the data contained in the key.

In this particular case, we were able to disable some backend code that was utilizing that key with no user facing impact and relieve the network pressure.

Overal this diagnostic process is quite manual and time intensive. 60 seconds of packet capture at 900Mbps generates close to 6GB of packet data for tshark to process, and if this process needs to be repeated on multiple machines the pain is also multiplied.

Welcome mctop!

Given this wasn’t a new issue for us I decided to have a crack at building a small tool to allow us to interactively inspect in-real time, the request rate and estimated bandwidth use by key. The end result is the tool “mctop” we’re open sourcing today.

Inspired by “top”, mctop passively sniffs the network traffic passing in and out of a server’s network interface and tracks the responses to memcache get commands. The output is presented on the terminal and allows sorting by total calls, requests/sec and bandwidth. This gives us an instantaneous view of our memcache get traffic.

Image

mctop is now available as a gem on rubygems.org and the source is available over at github.

Patches welcome, we hope you find it useful!

9 Comments