March 2013 Site Performance Report

Posted by Jonathan Klein | Filed under performance

Four more months have gone by since our last update, and it’s time for another performance report!  We made it through the holiday season with flying colors, and our engineers have been hard at work launching new features since the new year started.  Let’s see what impact this had on performance (spoiler: we keep getting faster).

Server Side Performance:

Here are the median and 95th percentile load times for signed in users on our core pages on Thursday, 3/14/13:

Server Side Performance Chart

As you can see we had small decreases in load time across the board (the one exception being a tiny uptick in median search performance).  We expected this, since we made a couple of small changes to our application code in the last couple of months that had a positive impact across every page.  The larger decrease in the 95th percentile load time for the homepage was primarily due to the removal of the Taste Test module.

As a quick reminder, the “Baseline” page is an extremely simple page that just includes our header and footer, and uses our normal controller architecture.  Improvements in the load time of this page mean improvements across every page on Etsy.

We also thought it would be fun to look back and compare our performance today to our performance from the very first performance report, back in August of 2011.  Since we don’t look at averages anymore, we can only compare the 95th percentile:

Performance comparison 1.5 years

Looking good!  We have made huge strides on the homepage, shop page, and profile page, and a small improvement on the search page.  The listing page hasn’t seen much optimization given its position as the fastest of these pages, so we’re happy with its modest gains.  It’s extremely gratifying to see how much progress we have made over the last ~18 months, and we hope to continue this trend in the months to come.

Front-end Performance:

We changed our strategy for monitoring front-end performance in our last update, and promised more stable metrics going forward.  That proved to be the case:

Frontend Performance - March 2013

Performance looks good here as well – reductions in load time for almost every page type again.  The one outlier here is our baseline page, which saw a significant increase in document complete time since our last update.  We’ve started digging into this, and it looks like a couple of extra HTTP requests have snuck into this page – a JS file and a sprite or two.  We’ll be consolidating these requests in the coming days and we expect to see the document complete time on this page drop back down.

As far as the rest of the improvements go, we believe that they are due to a combination of changes that we’ve deliberately made to improve performance, and a better hit rate at our CDNs.  We are currently load balancing static content across multiple CDNs, and we’ve worked to improve cachability so that fewer requests have to hit our origin.

Conclusion:

Overall we are happy with the way performance has been trending, especially because we’ve been focusing our efforts over the last 4 months on projects that don’t involve these core pages.  For the next couple of quarters our plan is to establish clear performance SLAs and bring all of our major pages into compliance with them.  Beyond that we are always looking for big wins and evaluating new standards like SPDY, WebP, and pre-fetching to see if they make sense for Etsy.  It’s an exciting time to be working on web performance!

Does this sound like something you would like to work on?  Our team is hiring!

You can follow Jonathan on Twitter at @jonathanklein

9 responses

Re-Exploring New Technologies: The Turbo Encabulator

Posted by jallspaw | Filed under engineering, infrastructure, operations, philosophy, video

One of the things that we like to do here at Etsy is to push the envelope.

Pushing the envelope means reaching deep into unknown technologies and approaches, and to experiment in areas that we are not familiar with. It’s in this vein that we’d like to share some early and unfiltered work: The Turbo Encabulator.

Each member of the TE team has stretched their cognitive abilities into new realms, and we’re pretty happy with the results thus far. We know that in many ways, engineering is about solidifying a solution well before the problem statement can be fully described within a socio-technical environment.

Of course, we also acknowledge that the notion of socio-emotional technocracy needs to fit within this paradigm. This brings us to some questions:

  • What heterogenius surface areas can we explore as prototypes for availability matching?
  • Under what conditions will our biases influence positively the results of anarcho-syndacalist architectures?
  • Where can we look for indications that our Mean-Time-To-Innocence (MTTI) is improving?

Understandably, these challenges are unilateral in their nature and we’ve made a large investment in creating an erinaceous atmosphere in order to produce results.

One of these results is the Turbo Encabulator. As you can imagine, sinusoidal repleneration threw us for a loop. :)
An internal tech talk can be seen below:

4 responses

There and Back Again: Migrating Geolocation Data to GeoNames

Posted by John Marc | Filed under data, databases, engineering, internationalization

People are passionate about where they live. At Etsy we need to keep track of lots of different locations, such as buyers’ billing and shipping addresses and sellers’ shop locations. As Etsy continues to expand internationally we wanted to provide better localization and translations for our location place names. We determined that the best way to effect this change was to move from using a closed location API provider to internal services backed by the open GeoNames data set.

Before we could start using GeoNames as our data source we had to map all our existing user locations onto the GeoNames data. There is no established method for creating this mapping so we looked at the data we had and figured out a method. For our existing user locations we had city, region, country, country code, latitude and longitude in a format like this:

tokyo             tokyo prefecture   japan            jp   35.670 139.740
luxembourg city   luxemburg          luxembourg       lu   49.609 6.129
newport           england            united kingdom   gb   50.700 -1.295
earth             texas              united states    us   34.277 -102.508

To execute our mapping we created a PHP script that was able to leverage the much broader amount of data GeoNames provides us and quickly determine if old and new locations matched up. This script is now available on Etsy’s github page. The script relies on a MySQL DB backend that contains the standard data tables distributed by GeoNames with indexes on place names, country codes, and the latitude and longitude columns.

For each row in our source data we attempt a few different strategies to try to find a match to the GeoNames data.

Our first step is to see if there are any exact text matches to the place name we have in the GeoNames data set. If there is one or more exact matches we sort them by distance from the source latitude and longitude and if the nearest place is less than 20km away we call it a match. 20km is a rather large radius and we could easily run through with much lower limits, but we found that since we sorted by distance already a large radius gave us more positive matches when our data sets disagreed on where the center of large cities should be.

If we don’t find an exact text match we look again twice more, once with wildcards around the place name in our source data and once searching the “alternatenames” column in the GeoNames data set instead of the “name” column. This helps us find locations whose names may have changed or have alternate spellings. Results are sorted by distance and again we keep the closest if it falls within our threshold.

If we still have not found a match we take a substring of the first several letters of the name and do a wildcard search for that. (The length of the substring depends on the length of the source name and is generally about 35%.) This helps eliminate problems resulting from inconsistent inclusion of things like ‘City’, ‘Township’ and ‘Borough’ from place names in our source data.

For this project it was important that we found a match for every location and so we added a final “match at all costs” step.  If none of the previous steps have succeeded we use the source location’s latitude and longitude and get a list of all locations in the source country that are within a few tenths of a degree.  We then sort the results and pick the closest.  This was necessary for less than 1% of our data and allowed us to continue providing local search services even when we were unable to match a city exactly. This was a very exploratory, iterative process as we discovered what worked and what did not, and where our existing data was incomplete or inaccurate. From tuning the radius of the search areas, to dropping all the region data outside of the US (we found that it was contributing a lot of false positive matches, and the results were better without it), we revised and refined the logic until it gave us satisfactory results.

After we established a mapping we added a new field to store each seller’s “GeoNameID” in our DB. Where previously we stored all the fields listed above now we need only the GeoNameID.  We do still store the old data format as well which has allowed us to make an easy and transparent transition from using the old data to the new data.  With access to the full breadth of data provided by GeoNames we have been able to speed our Local Search feature.  We now also localize spellings of place names for our international users, so someone searching in German can search near Köln while someone searching in English can look for Cologne.

I will be giving a talk on how we handle search and localized place name auto-suggest at Lucene/Solr revolution in San Diego in May and we’ll be publishing more Code as Craft blog posts on geolocation services in the coming months here as well.

You can follow John Marc on Twitter @thejohnmarc

No responses

Data Corruption To Go: The Perils Of sql_mode = NULL

Posted by Keyur Govande | Filed under databases, engineering, operations

A little while back, we hit an extremely embarrassing issue: 32-bit signed integer overflow on some primary keys.

In our architecture, an unsigned 64-bit key is generated by a global ticket server, but the sharded tables’ schemas did not use BIGINT UNSIGNED as the column data-type. Inserts into the affected tables started failing en masse with “duplicate key” errors on key 2147483647.

We quickly realized the issue and ALTERed the impacted tables, but the MySQL behavior of silently truncating all values larger than 2147483647 with the only fail-safe being the primary key constraint was worrying. Any columns that lacked a similar constraint would be experiencing silent data corruption.

Some digging led us to the MySQL variable sql_mode.

What is sql_mode anyway?

By default MySQL is very accepting of bad data and bad queries. At best, it may emit a WARNING, but these are almost always ignored by client libraries such as PHP’s PDO.

sql_mode is a run-time setting for MySQL and enabling it makes the database much more vocal by changing most WARNINGs to ERRORs. This helps prevent silent data corruption on write, and unexpected results on read. Here are a few examples of issues it helped fix within our stack and the settings we used:

PROBLEM: Missing default values, data-type and data-length mismatches

CREATE TABLE `sql_test`
    `id` TINYINT UNSIGNED NOT NULL PRIMARY KEY,
    `value` VARCHAR(3) NOT NULL,
    `column_not_null` VARCHAR(1) NOT NULL
) Engine = InnoDB;

Consider the following inserts:

  1. INSERT INTO sql_test (id, value, column_not_null) VALUES (10000, 'z', 'y');
  2. INSERT INTO sql_test (id, value, column_not_null) VALUES (1, 'abcde', 'x');
  3. INSERT INTO sql_test (id, value) VALUES (2, 'qwerty');

Here’s what the table would look like when sql_mode is disabled:

+-----+-------+-----------------+
| id  | value | column_not_null |
+-----+-------+-----------------+
|   1 | abc   | x               |
|   2 | qwe   |                 |
| 255 | z     | y               |
+-----+-------+-----------------+

Massively different from the expected values!

When sql_mode='STRICT_ALL_TABLES' is set though, each of those inserts will fail hard so that you can recover and handle the problem in your application.

  1. ERROR 1264 (22003): Out of range value for column 'id' at row 1
  2. ERROR 1406 (22001): Data too long for column 'value' at row 1
  3. ERROR 1364 (HY000): Field 'column_not_null' doesn't have a default value

PROBLEM: Character-set issues

CREATE TABLE `charset_test` (
    `message` VARCHAR(3000) NOT NULL -- NB: this could also be `message` TEXT NOT NULL
) Engine = InnoDB DEFAULT CHARSET=UTF8;

Inserting a supplementary UTF-8 character string like UTF8 supplementary character example would seem to work, but on a subsequent SELECT, the returned string would be quite different: UTF8 Supplementary Fail

This is because MySQL’s UTF8 implementation can only accommodate characters up to 3 bytes in length, but Unicode characters could be up to 4 bytes long. The data stored in the message column is subtly truncated and corrupted.

With the stricter sql_mode='STRICT_ALL_TABLES', inserts containing 4-byte characters will fail hard with: ERROR 1366 (HY000): Incorrect string value

Note that sql_mode setting here is only an alarm-bell, it is not fixing the underlying issue. The real solution is twofold:

  1. Change the character-set of the connection to MySQL to UTF8MB4. If using PHP’s PDO, pass it in via the connection DSN. Or when running queries by hand, make sure to execute SET NAMES utf8mb4 before executing anything.

  2. Modify the column and the table definitions to UTF8MB4:
    ALTER TABLE `charset_test`
        DEFAULT CHARSET utf8mb4,
        MODIFY `message` VARCHAR(3000) CHARACTER SET utf8mb4 NOT NULL;

One caveat to note when using the UTF8MB4 charset: the rule-of-thumb for an indexed VARCHAR column being at most 255 characters is no longer valid. MySQL has a length limit of 767 bytes for the index prefix and it has to accommodate the worst case. Thus when using the 3-byte UTF8 character set, 255 characters fit in, but when using 4-byte UTF8MB4 only 191 characters fit.

If we absolutely need 255 characters indexed (for example if the column holds email addresses), a workaround is to SHA-1 hash the value, and store and index that column instead.

PROBLEM: Unsigned int decrements

CREATE TABLE `decrement_test` (
    `id` int(11) NOT NULL,
    `value` int(11) unsigned NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

INSERT INTO decrement_test(id, value)
    VALUES(1, 20) ON DUPLICATE KEY UPDATE
    value = value + VALUES(value);

SELECT * FROM decrement_test;
+----+-------+
| id | value |
+----+-------+
| 1  | 20    |
+----+-------+

INSERT INTO decrement_test(id, value)
    VALUES(1, -1) ON DUPLICATE KEY UPDATE
    value = value + VALUES(value);

If you were expecting the value to now be 19, you would be wrong:

SELECT * FROM decrement_test;
+----+-------+
| id | value |
+----+-------+
| 1  | 20    |
+----+-------+

When sql_mode='STRICT_ALL_TABLES' is enabled, the decrement INSERT statement will fail with: ERROR 1264 (22003): Out of range value for column 'value' at row 1

Note that in order to do atomic decrements as shown above, the column data type would need to be signed-integer rather than unsigned-integer.

PROBLEM: Incorrect group-by

CREATE TABLE `group_by_test`
    `id` INT NOT NULL PRIMARY KEY,
    `name` VARCHAR(30) NOT NULL,
    `status` INT NOT NULL,
    `region` INT NOT NULL
) Engine = InnoDB DEFAULT CHARSET = UTF8MB4;

INSERT INTO group_by_test(id, name, status, region) VALUES (1, 'etsy', 10, 100);
INSERT INTO group_by_test(id, name, status, region) VALUES (2, 'ebay', 12, 100);
INSERT INTO group_by_test(id, name, status, region) VALUES (3, 'amazon', 13, 150);
INSERT INTO group_by_test(id, name, status, region) VALUES (4, 'fab', 10, 100);

SELECT COUNT(1), status, name FROM group_by_test GROUP BY status;
+----------+--------+--------+
| COUNT(1) | status | name   |
+----------+--------+--------+
|   2      | 10     | etsy   |
|   1      | 12     | ebay   |
|   1      | 13     | amazon |
+----------+--------+--------+

The `name` column isn’t in the GROUP BY clause, but is in the SELECT and MySQL pulls the first matching value and puts it into the result set. The output in the `name` column depends on relative order of rows in the table and may not be repeatable and this is almost never desirable.

By setting sql_mode='ONLY_FULL_GROUP_BY', we can prevent such unreliable SQL from executing. The query would instead fail with: ERROR 1055 (42000): 'etsy.group_by_test.name' isn't in GROUP BY

Just throw the switch and test in Production?

Unfortunately our application was continually executing some variations of the bad queries shown above, especially the bad GROUP BY and bad DEFAULT/NOT NULL, and we couldn’t just enable the mode in Production without causing a lot of disruption.

We had to find the bad queries and table definitions first. In order to do that, we set up a test database with the Production schema on it and enabled the strict sql_mode='ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' on it.

We slurped all queries from our Production DBs using a local tcpdump and netcat’ed them over to the test database:

prodb-shard> sudo /usr/sbin/tcpdump -i eth0 \
  "port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2" \
  -s 65535 -x -n -q -tttt | \
  nc testdb.etsy.com 20000

On the test box, we used the combination of Percona Toolkit’s pt-query-digest and a custom Go script to execute these queries (NB: Percona Playback did not exist at the time). Our Production servers execute ~4000 queries/sec (~30000 packets/sec in and out) and while pt-query-digest has a built-in execute mode, it cannot keep up with the volume and the tcpdump capture would drop packets. The Go script helped multiplex this query volume and keep packet loss low (~1%).

testdb> nc -l 20000 | pt-query-digest --type tcpdump \
  --no-report --print | grep -v "SET NAMES utf8" | \
  go run shard_executor.go -db-user=username \
  -db-password=password -db-host=localhost \
  -db-charset=utf8mb4 -log "/var/tmp/shard.log" \
  -db-name=etsy_shard -threads=8

Note that pt-query-digest leaks a small amount of memory continually and will need to be restarted every so often.

All errors were logged to disk and we now had visibility into exactly what needed fixing:

testdb> cut -c 21- /var/tmp/shard.log | grep "^Error" | \
  sort | uniq

Along with fixing bad queries, we also needed to modify a large number of tables to change the character-set for VARCHAR and TEXT columns from UTF8 to UTF8MB4 in order to accommodate 4-byte UTF-8 characters. We generated the ALTERs with this one-off script and Schemanator allowed us to execute all of these without any downtime!

Note that the script does not generate ALTER statements for indexed VARCHAR/TEXT columns longer than 191 characters for reasons explained above. In our case, we worked around it by reducing the column lengths where feasible, and using the hashing method otherwise.

A few weeks of bug squashing and schema changes later, our databases and applications were finally ready to accept sql_mode= 'ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES'.

CHECK YOUR FLAGS!

We strongly encourage everyone running MySQL to review the various sql_mode flags and choose the ones that make the most sense for their set up.

'ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend.

We don’t use any of MySQL’s DATE/TIME data-types nor any Engine other than InnoDB in our schemas, but if you do, we strongly suggest also using 'NO_ZERO_IN_DATE,NO_ZERO_DATE,NO_ENGINE_SUBSTITUTION'

References and resources

You can follow Keyur on Twitter at @keyurdg

9 responses

Java: Not Even Once

Posted by avleenetsy | Filed under infrastructure, operations, security

Note: This post was co-written by Avleen Vig (@avleen) and Zane Lackey (@zanelackey).

In mid-January of this year we started an initiative to remove Java browser plugins from all employee systems at Etsy, as we feel this is a best practice to be striving towards. To that end, we wanted to discuss the challenges we encountered when removing Java browser plugins in the hope that it will help other organizations with the removal process.

The first question we needed to answer before removing Java was “Who actually needs Java in their browser?” Most organizations face the unfortunate reality that some number of internal systems use Java applets, and that there are groups which use these systems on a daily or weekly basis. In our case, our operations team needed Java to be able to access a group of internal network appliances, such as IPMI console interfaces. Once we identified these requirements, we disabled Java browser plugins across the rest of the organization and set about trying to engineer a safer way for the team to access these network appliances.

Initially, we looked at three approaches:

  1. Installing the Java plugin in a dedicated browser/browser profile and asking the team to use it to only access the network appliances.
  2. Writing a wrapper script to copy the Java plugin into the plugins directory, launch a dedicated browser, then remove the Java plugin when the browser closes.
  3. Using nssecurity to whitelist the hosts that could instantiate Java in the browser.

However, all of these approaches didn’t fulfill our design goals that the approach be safe by default, be easy to maintain, and ideally wouldn’t require the Java browser plugin on the teams laptops at all.

We realized the only way to approach the situation that met our requirements would be to have Java installed in a dedicated, controlled, and isolated environment.

This model is similar to a bastion host, or “jump box”. We opted to use NoMachine as the remote desktop protocol because of the increased performance and usability over low latency links. We have operations engineers located in multiple countries and we also occasionally need to use 3G/4G mobile services to diagnose problems, so this was critically important.

The installation method we followed for CentOS 6 and FreeNX was:

  1. Install the required packages. All of these come from the standard CentOS repos:
    yum install -y jre freenx gnome-desktop gnome-session \
        gnome-panel nautilus firefox fre

    (After installation, be sure to replace the provided SSH keys: https://help.ubuntu.com/community/FreeNX#Using_custom_SSH_keys)

  2. Create a symlink from the Java browser plugin, to the plugins directory used by Firefox:
    ln -s /usr/java/latest/lib/amd64/libnpjp2.so \
    /usr/lib64/mozilla/plugins/libnpjp2.so
  3. Now create
    /etc/nxserver/node.conf

    with these contents:
    ENABLE_PASSDB_AUTHENTICATION="1"
    NX_LOG_LEVEL=7
    NX_LOGFILE=/var/log/nx/nxserver.log
    SESSION_LOG_CLEAN=0
    COMMAND_MD5SUM="md5sum"

FreeNX is now configured!

Each of your users who wants to use the new system needs to do the following steps once to add their user to the NX authentication database:

(NOTE: These steps below follow the Ubuntu FreeNX instructions of using a shared account/SSH key for access to the jump system. In this circumstance the risk was accepted as the network appliances that are the end target also use a shared account, so no additional risk was introduced. Obviously different circumstances will have different security requirements.)

  1. Copy /etc/nxserver/client.id_dsa.key to your local machine and save it as ~/.ssh/nx_dsa
  2. On the jump host, run sudo nxserver --adduser <username>
    This adds your account to the nxserver database.  Note: at this step it will also add a key to your .ssh/authorized_keys2 file, if you manage this in your configuration management system it will get overwritten, so you should add the key there.
  3. On the server again, run sudo nxserver --passwd <username>
    This sets your password in the nxserver database
  4. Download the “NoMachine Player v4″ from: http://www.nomachine.com/preview/download-package.php
  5. Start the player, click “New connection”.
    1. Name: jumpbox
    2. Host: jump.your.domain
    3. Select: Use the NoMachine login, then click the ... button
    4. Check Use an alternate key and point it to ~/<username>/.ssh/nx_dsa
    5. Press the X twice and then click the jumpbox connection you see.
    6. Enter your login details.
    7. Click Create a new session
    8. Click Create a new GNOME virtual desktop
    9. You should get a gnome desktop, with a firefox icon.

Using the jump system approach, we’re now able to firewall off these hosts from Internet browsing, we can re-image them on a frequent basis, and we have a very small number of hosts that need to receive Java updates. This approach also allows us to configure our endpoint management software to nuke Java from orbit if the browser plugin ever shows up on employee systems. While we generally abhor single points of failure in our infrastructure, we felt comfortable in this case because the use of configuration management meant that the environment could be quickly rebuilt if needed. Finally, this approach could also be used to isolate other inherently risky technologies when they are still required by certain groups.

In closing, we genuinely hope this helps other organizations facing similar challenges on the road to removing Java browser plugins from the enterprise.

10 responses

Schemanator: Love Child of Deployinator and Schema Changes

Posted by Jeff Kolber | Filed under databases, operations

We’ve previously written about our sharded master-master pair database architecture and how that and the Etsy ORM allows us to perform schema changes while keeping the site up. That we can do this at all is really awesome but to actually do this is still hard, risky and time-consuming. We run the site on half of our database servers while we make schema changes to the other half. Then we switch sides and do it again. It’s many steps on many machines and it’s happening while 100+ developers are pushing code. Let’s take a closer look at how we pull this off and the automation we’ve developed around this process.

I want to pause to acknowledge that being able to run the site on half of our database servers is in itself the result not only of good design in the ORM but of good ongoing capacity planning in making sure we can carry a full load on half the servers.

To understand Schemanator let’s first take a look at how we did schema changes “by hand”. To complete a schema change we need to change the ORM’s configuration a total of 4 times – first to take one half of the databases out of production so we can apply change to them and then to put them back. Then we take the other half out and apply changes and finally put them back. At Etsy we can deploy relatively quickly – generally less than 20 minutes from commit to live — and we have an even more streamlined process for updating just the application configuration — a “config push” as we call it. The challenge with this comes from the fact that at any point during a day, people are waiting their turn to deploy code in our push queue. To do our schema change, we’ll need to wait through the push queue 4 times. Not hard, but not amenable to getting things done quickly.

Once we’ve done the first config push to pull out half the DBs, we’re ready to apply the schema changes to the set of dormant servers. For this we’d use some shell-fu in a screen or tmux session to run commands on all the databases at once. When those changes were done, we’d get back in the push queue to put the updated DBs back in production, watch that it’s stable and then take the other side out. Then go back to the terminal, connect to all the other databases and run the changes on them. When that’s done, it’s back to the push queue to return the ORM to its full configuration. Along the way we’d have to deal with Nagios, run checksums on the new schemas on each host, and monitor DB status and site errors. And at the end we should end up with something we can call a log of the changes we made.

We did versions of that process for a while. It worked but no one is perfect and we did have an outage stemming from errors made while doing schema changes. This incident helped crystalize the “quantum of deployment concept as applied to schema changes” for us. We asked ourselves “What’s the smallest number of steps, with the smallest number of people and the smallest amount of ceremony required to get new schemas running on our servers?” With an answer to that, we knew we could, and had to, put a button on it.

Thus was born Schemanator.

From Continuous Deployment to Concurrent Deployment

From the description of the schema change process above it’s clear that one of the pain points is having to wait in the push queue to deploy configuration changes. In order to be truly awesome, Schemanator would have to provide an avenue for the operator to bypass the push queue altogether. In practical terms, not having to wait in the queue shaves at least an hour off the whole process. This part of Schemanator is the first part we wanted to address since it delivered the most overall time savings in the shortest time.

To help understand how we made this part work here’s a bit more about our ORM setup. On each request, the configuration file is read in. Part of that configuration is the set of DSNs for all available database servers. We store these in a hash keyed by the shard number and side like this:

$server_config["database"] = array(
    'etsy_shard_001_A' => 'mysql:host=dbshard01.etsy.com;port=3306;
dbname=etsy_shard;user=etsy',
    'etsy_shard_001_B' => 'mysql:host=dbshard02.etsy.com;port=3306;
dbname=etsy_shard;user=etsy',
    'etsy_shard_002_A' => 'mysql:host=dbshard03.etsy.com;port=3306;
dbname=etsy_shard;user=etsy',
    'etsy_shard_002_B' => 'mysql:host=dbshard04.etsy.com;port=3306;
dbname=etsy_shard;user=etsy',
    ...
);

Before Schemanator, we would literally just comment out the lines with the DSNs we didn’t want the ORM to use. We didn’t want to automate commenting out lines of code in our config file, and even if we did, doing so would have still left us having to wait in the push queue. So we made a slight change to the ORM setup routine. When the ORM sets up, it now also checks if a special file — a “disabled connections” file exists. If it does, it’s read in as a list of DSNs to ignore. Those servers are skipped when the ORM sets up and the application ends up not using them. Since the config is read at the begining of the web request and PHP is shared nothing, once the disabled connections file is in place, all subsequent requests on that server will respect it.

By carefully deploying just that one special disabled connections file into the live docroot we get the changes we need to the ORM…and we can do this while people are pushing code changes to the site. Specifically, we updated our general deploy to exclude the disabled connections file to eliminate possible race conditions on deploy and we set up Deployinator to allow us to deploy the disabled connections file on its own.

But there was a problem. Like many php sites we use APC for opcode caching. Part of each deploy included gracefully restarting Apache to clear the opcode cache. Dropping the one file in is all well and good but we’d still have to clear it from APC. After considering some options, we chose to turn on apc.stat. This tells APC to stat each file before returning the cached opcodes. If the file is newer than the cached opcodes then re-read the file and update the cache. We run our docroot from a RAM disk so the extra stats aren’t a problem. With apc.stat on we could drop in our disabled connections file and the next request will start using it. No restarts required. We did need to increase the size of our APC cache to allow this. We were able to stop doing restarts on each deploy and since most deploys only change a small subset of our codebase, we saw an improvement in our cache-hit ratio. With this process in hand, and after much testing, we were able to allow the disabled connections file to be deployed concurrently with our regular deploys.

We call this part of Schemanator “Side Splitter”. It provides a Web GUI for selecting which DB servers to disable and a button to deploy that configuration. Pushing the button writes and deploys the disabled connections file. There are also a number of sanity checks to make sure we don’t do things like pull out both sides of the same shard.  We use this not just as part of schema changes, but also when we need to pull DBs out for maintenance.

Schemanator UI

UI For selecting which sides of which shards to disable.

Ch-ch-changes

Next we needed a process for applying the schema changes to the databases. A number of the choices we had to make here were strongly influenced by our environment. Etsy is in PHP and we use Gearman for asynchronous jobs so using anything but PHP/Gearman for Schemanator would have meant re-inventing many wheels.

There are many things that could possibly go wrong while running schema changes. We tried to anticipate as many of these as we could: the operator’s internet connection could die mid-process, the application could keep connecting to disabled databases, gearman workers could die mid-job, the DDL statements might fail on a subset of servers, etc., etc. We knew that we couldn’t anticipate everything and that we certainly couldn’t write code to recover from any possible error. With that awareness – that something unexpected would eventually happen during schema changes – we designed Schemanator to allow the operator to pause or stop the process along the way. Each sub-task pauses and shows the operator an appropriate set of graphs and monitors to allow them to evaluate if it’s safe to proceed. This increases our liklihood of detecting trouble and gives us a way to bail out of the process, if that is the best course of action.

Central to Schemanator is the concept of the “changeset” – a data structure where we store everything about the schema change: the SQL, who created it, test results, checksums, and more. The changeset acts as both a recipe for Schemanator to apply the changes and a record of the work done.

The early parts of the Schemanator workflow center around defining and testing the changeset. To test a changeset, a gearman job loads up the current schema with no data and applies the changes there. If there are errors the changeset will be marked as failing and the errors reported back. We also generate the checksums we’ll look for later when we apply the changes in production.

We recently added a test that inspects the post-change schema to make sure that we’re keeping to our SQL standards. For example, we make sure all tables are InnoDB with UTF-8 as the default character set, that we we don’t add any AUTO INCREMENT fields, which would be trouble in our Master-Master setup. We recently had an issue where some tables had a foreign key field as INT(11) but the related table had the field as BIGINT. This caused errors when trying to store the BIGINT in the INT(11) field. To catch this going forward, Schemanator now checks that our keys are all BIGINTs. Once the changeset passes all the tests, it can be run.

When Schemanator runs a changeset, the first thing it does is to tell Nagios not alert for the servers we’re about to update. We use xb95′s nagios-api which allows Schemanator to set and cancel downtime. Schemanator’s “Preflight Check” screen shows the current status of the DB cluster and also checks Nagios to see if any of the DB servers have alerts or warnings. This gives an at-a-glance view to know if it’s OK to proceed.

Schemanator: Preflight Checks

Schemanator: Preflight Checks

Once things look OK, the operator can click the button to “Do it: Pull those sides out”. Schemanator will deploy the first of the 4 disabled connections configs. When that config is live, Schemanator drops you back on a page full of graphs where you can confirm that traffic has moved off of the pulled servers and is being adequately handled by the remaining servers. Once that is stable the operator clicks a button to apply the SQL to dormant servers. This is handled by a set of Gearman jobs – one job per database – that connect to the DBs and apply the updates. Schemanator monitors each job and polls the MySQL Process List on the DBs so the operator has good view of what’s happening on each server. Any errors or checksum mismatches bubble up in the UI so the operator can decide how to deal with them.

The progress of an alter running on a remote DB

Schemanator: Showing the progress of an alter running on a remote DB and the process list from that DB with our alter highlighted in yellow

When all the workers are done, Schemanator prompts the user that it’s ready to move on. From here it’s a matter of repeating these steps with the right variations until the changes have been applied to all the databases. Schemanator handles setting and canceling Nagios downtime, checksumming any tables that were updated and logging everything.

We’ve been using Schemanator to help with schema changes for a few months now and it’s achieved most of the goals we’d hope for: increased the speed and confidence we can do schema change, increased reliability that we don’t forget any of the many steps involved, and freed up people from having to devote often a full day to schema changes. We generally make schema changes only once a week, but with Schemanator, if we have to do them spur of the moment, that’s no longer out of the question. While Schemanator doesn’t do everything, we feel it’s the appropriate level of automation for a sensitive operation like schema changes.

14 responses

mctop – a tool for analyzing memcache get traffic

Posted by Marcus Barczak | Filed under infrastructure, operations

Here at Etsy we (ab)use our memcache infrastructure pretty heavily as a caching layer between our applications and our database tiers. We functionally partition our memcache instances into small pools and overall it works fabulously well. We have however suffered occasionally from what we call “hot keys”.

What is a “Hot” key?

A “hot key” is a single key hashed to an individual memcache instance with a
very high get rate, often being called once for every page view. For the most part network bandwidth across all memcache instances within a pool is relatively balanced. These hot keys, however, contribute a significant additional amount of egress network traffic and have the potential to saturate the available network bandwidth of the interface.

Image

The graph above is an example of a recent hot key issue. The graph y-axis represents bytes per second inbound and outbound of memcached01′s network interface.

As we hit peak traffic, memcached01′s network interface was completely saturated at approximately 960Mbps (it’s a 1Gbps NIC). This has a particularly nasty impact to get latency:

Image

As we began to push past 800Mbps outbound, 90th percentile get request latency jumped from 5ms to 35ms. Once the NIC was saturated latency spiked to over 200ms.

Diagnosing the Issue

This wasn’t the first time a hot key had been responsible for unsually high network bandwidth utilization so this was our first line of investigation. Comparatively memcached01′s bandwidth utilization was significantly higher than the other servers in the pool.

Diagnosing which key was causing problems was a slow process, our troubleshooting process took the following steps:

  1. Take a brief 60 second packet capture of the egress network traffic from memcached01
  2. Using the tshark (wireshark’s awesome command line cousin) extract the key and response size from the memcache VALUE responses in captured packet data.
  3. Post process the tshark output to aggregate counts, estimate requests per second and calculate the estimated bandwidth per key.
  4. Sort that list by bandwidth then further investigate that key.

Once the potentially offending key is found we’d repeat this process from a couple of client machines to validate this as the offending key. Once the key was confirmed engineers would look at alternate approaches to handling the data contained in the key.

In this particular case, we were able to disable some backend code that was utilizing that key with no user facing impact and relieve the network pressure.

Overal this diagnostic process is quite manual and time intensive. 60 seconds of packet capture at 900Mbps generates close to 6GB of packet data for tshark to process, and if this process needs to be repeated on multiple machines the pain is also multiplied.

Welcome mctop!

Given this wasn’t a new issue for us I decided to have a crack at building a small tool to allow us to interactively inspect in-real time, the request rate and estimated bandwidth use by key. The end result is the tool “mctop” we’re open sourcing today.

Inspired by “top”, mctop passively sniffs the network traffic passing in and out of a server’s network interface and tracks the responses to memcache get commands. The output is presented on the terminal and allows sorting by total calls, requests/sec and bandwidth. This gives us an instantaneous view of our memcache get traffic.

Image

mctop is now available as a gem on rubygems.org and the source is available over at github.

Patches welcome, we hope you find it useful!

9 responses

Measuring Front-end Performance With Real Users

Posted by Jonathan Klein | Filed under performance

When we published our last performance update, we got a comment about the lack of RUM data for front-end performance from Steve Souders.  Steve followed up his comment with a blog post, stating that real users typically experience load times that are twice as slow as your synthetic measurements.  We wanted to test this theory, and share some of our full page load time data from real users as well. To gather our real user data we turned to two sources: LogNormal and the Google Analytics Site Speed Report.  Before we put up the data, there are a few caveats to make:

  • For the day in question (11/14/12) we are providing data for the top three browsers that our customers use, all of which support the Navigation Timing API.  This gives us the most accurate RUM data we can get, but introduces a small bias.  This sample encompassed 43% of our customers on this day.
  • This isn’t completely apples to apples, since Google Analytics (GA) uses average load time by default and WebPagetest/LogNormal use median load time.  The problem with averages has been well documented, so it’s a shame that GA still gives us averages only.  To get rough median numbers from GA we used the technique described in this post.  This results in the range that you will see on the chart below.
  • The WebPagetest numbers are for logged out users, and we don’t have signed in vs signed out data from LogNormal or Google Analytics on that day, so those numbers cover all users (both logged-in and logged-out).  We expect numbers for logged-out users to be slightly faster, since there is less logic to do on the backend and there are some missing UI elements on the front-end in some cases.
  • The WebPagetest 50/50 numbers are calculated by taking the average of the empty cache and full cache WebPagetest measurements (more on that below).

With those points out of the way, here is the data:

So what’s going on here?  Our RUM data is faster than our synthetic data in all cases, and in all cases except for one (Shop pages in Chrome 23) our two RUM sources agree. Let’s see if we can explain the difference in our findings from Steve’s.  According to Google Analytics, 72% of our visitors are repeat visitors, which probably means that their cache is at least partly full.  Since cache is king when it comes to performance, this gives real users a huge advantage performance wise over a synthetic test with an empty cache.  In addition, around 60% of our visits are from signed-in users, who likely visit a lot of the same URLs (their shop page, profile page, their listings) which means that their cache hit rate will be even higher.  We tried to account for this with the WebPagetest 50/50 numbers, but it’s possible that the hit rate of our customers is higher than that (this is on our list of things to test).  Also, the WebPagetest requests were using a DSL connection (1.5 Mbps/384 Kbps, with 50ms round trip latency), and our users tend to have significantly more bandwidth than that:

It’s encouraging to see that LogNormal and Google Analytics agree so closely, although GA provides a wide range of possible medians, so we can’t be 100% confident about assessment.  The one anomaly there is Shop pages in Chrome 23, and we don’t have a great explanation for this discrepancy.  Sample size is fairly similar (GA has 38K samples to LogNormal’s 60K), and the numbers for logged-in vs. logged-out numbers are the same in LogNormal, so it’s not related to that.  The histogram in LogNormal looks pretty clean, and the margin of error is only 56ms.  GA and LogNormal do use separate sampling mechanisms, so there could be a bias in one of them that causes this difference.  Luckily it isn’t large enough to worry too much about. It’s worth pointing out that when we start looking at higher percentiles in our real user monitoring things start to degrade pretty quickly.  The 95th percentile load time as reported in LogNormal for Chrome 23 is 8.9 seconds – not exactly fast (in Google Analytics the 95th percentile falls into the 7-9 seconds bucket).  Once you get out this far you are essentially monitoring the performance of last mile internet connectivity, which is typically well beyond your control (unless you can build fiber to your customers’ doorsteps).

Overall we are showing different results than what Steve predicted, but we think this can be largely explained by our huge percentage of repeat visitors, and by the fact that we are using a DSL connection for our synthetic tests.  The takeaway message here is that having more data is always a good thing, and it’s important to look at both synthetic and RUM data when monitoring performance.  We will be sure to post both sets of data in our next update.

10 responses

October 2012 Site Performance Report

Posted by Jonathan Klein | Filed under performance

It’s been about four months since our last performance report, and we wanted to provide an update on where things stand as we go into the holiday season and our busiest time of the year.  Overall the news is very good!

Server Side Performance

Here are the median and 95th percentile load times for core pages, Wednesday 10/24/12:

As you can see, load times declined significantly across all pages.  A portion of this improvement is due to ongoing efforts we are making in Engineering to improve performance in our application code.  The majority of this dip, however, resulted from upgrading all of our webservers to new machines using Sandy Bridge processors.  With Sandy Bridge we saw not only a significant drop in load time across the board, but also a dramatic increase in the amount of traffic that a given server can handle before performance degrades.  You can clearly see when the cutover happened in the graph below:

This improvement is a great example of how operational changes can have a dramatic impact on performance.  We tend to focus heavily on making software changes to reduce load time, but it is important to remember that sometimes vertically scaling your infrastructure and buying faster hardware is the quickest and most effective way to speed up your site.  It’s a good reminder that when working on performance projects you should be willing to make changes in any layer of the stack.

Front-end Performance

Since our last update we have a more scientific way of measuring front-end performance, using a hosted version of WebPagetest.  This enables us to run many synthetic tests a day, and slice the data however we want.  Here are the latest numbers, gathered with IE8 from Virginia over a DSL connection as a signed in Etsy user:

These are median numbers across all of the runs on 10/24/12, and we run tests every 30 minutes.  Most of the pages are slower as compared to the last update, and we believe that this is due to using our hosted version of WebPagetest and aggregating many tests instead of looking at single tests on the public instance.  By design, our new method of measurement should be more stable over the long term, so our next update should give a more realistic view of trends over time.

You might be surprised that we are using synthetic tests for this front-end report instead of Real User Monitoring (RUM) data.  RUM is a big part of performance monitoring at Etsy, but when we are looking at trends in front-end performance over time, synthetic testing allows us to eliminate much of the network variability that is inherent in real user data.  This helps us tie performance regressions to specific code changes, and get a more stable view of performance overall.  We believe that this approach highlights elements of page load time that developers can impact, instead of things like CDN performance and last mile connectivity which are beyond our control.

New Baseline Performance Measurements

Another new thing we created is an extremely basic page that allows us to track the lower limit on load time for Etsy.com.  This page just includes our standard header and footer, with no additional code or assets.  We generate some artificial load on this page and monitor its performance.  This page represents the overhead of our application framework, which includes things like our application configuration (config flag system), translation architecture, security filtering and input sanitization, ORM, and our templating layer.  Having visibility into these numbers is important, since improving them impacts every page on the site.  Here is the current data on that page:

Over the next few months we hope to bring these numbers down while at the same time bringing the performance of the rest of the site closer to our baseline.

24 responses

Scaling User Security

Posted by Zane Lackey | Filed under engineering, security

Summer is ending
New security features
Sweeping in like Fall

The Etsy Security Team is extremely happy to announce the simultaneous release of three important security features: Two factor authentication, full site SSL support, and viewable login history data. We believe that these protections are industry best practice, and we’re excited to offer them proactively to our members on an opt-in basis as a further commitment to account safety. A high level overview of the features is available here, while on Code as Craft we wanted to talk a bit more about the engineering that went into the SSL and two factor authentication features.

Rolling out Full Site SSL

When we initially discussed making the site fully accessible over SSL, we thought it might be a simple change given our architecture at the time. During this time period we relied on our load balancers to both terminate SSL and maintain the logic as to which pages were forced to HTTPS, which were forced to HTTP, and which could be either. To test out our “simple change” hypothesis we set up a test where we attempted to make the site fully SSL by disabling our load balancer rules that forced some pages down to HTTP. After this triggered a thrilling explosion in the error logs, we realized things weren’t going to be quite that easy.

The first step was to make our codebase HTTPS friendly. This meant cleaning up a significant number of hard coded “http://” links, and making all our URL generating functions HTTPS aware. In some cases this meant taking advantage of scheme relative URLs. We also needed to verify that all of our image storage locations, and our various CDNs could all play nicely with SSL.

Next up was moving the logic for enforcing whether a URL could be HTTP, HTTPS, or both from the load balancer to the application itself. With the logic on the load balancer, adding or changing a rule went like this:

  1. Engineer makes a ticket for ops
  2. Ops logs into the load balancer admin interface
  3. Ops needs to update the rule in three places (our development, preprod, and prod environments)
  4. Ops lets engineer know rule was updated, hopes all is well
  5. In case of rollback, ops has to go back into admin interface, find the rule (the admin interface did not sort them in any meaningful way), change the rule back, and hope for the best

In this flow, changes have to be tracked through a change ticket system not source control and there is no way for other engineers to see what has been updated. Why is application logic in our load balancers anyway? Wat?

To address these issues, we moved all HTTPS vs HTTP logic into the web server via a combination of .htaccess rules and hooks in our controller code. This new approach provided us with far greater granularity on how to apply rules to specific URLs. Now we can specify how URLs are handled for groups of users (sellers, admins, etc) or even individual users instead of using load balancer rules in an all-or-nothing global fashion. Finally, the move meant all of this code now lives in git which enables transparency across the organization.

HSTS

HSTS is a new header the instructs browsers to only connect to a specific domain over HTTPS in order to provide a defense against certain man-in-the-middle attacks. As part of our rollout we are making use of this header when a user opts in to full site SSL. Initially we are setting a low timeout value for HSTS during rollout to ensure things operate smoothly, and we’ll be increasing this value via a config push over time as we’re confident there will be no issues.

Why not use full site SSL for all members and visitors?

First and foremost, rolling out SSL as the default for all site traffic is something we’re actively working on and feel is the best practice to be striving towards. As with any large scale change in site-wide traffic, capacity and performance are significant concerns. Our goal with making this functionality available on an opt-in basis at first is to provide it to those members who use riskier shared network mediums such as public WiFi. Going forward, we’re analyzing metrics around CDN performance (especially for our international members), page performance times of SSL vs non-SSL, and overall load balancer SSL capacity. When we’re confident in the performance and capacity figures, we’re excited to continue moving towards defaulting to full site SSL for all members and visitors.

Two factor authentication

Our main focus during the course of our two factor authentication project (aside from security) was how to develop and apply metrics to create the best user experience possible over the long term. Specifically, the questions we wanted to be able to answer about the voice call/SMS delivery providers we selected were:

  • “Does provider A deliver codes faster than provider B?”
  • “Does provider A deliver codes more reliably than provider B?”
  • “Can we easily swap out provider A for provider C? or D? or F?”

Provider abstraction

From the beginning we decided that we did not want to be tied to a single provider, so abstraction was critical. We went about achieving this in two ways:

  1. Only relying on the provider for transmission of the code to a member. All code generation and verification is performed in our application, and the providers are simply used as “dumb” delivery mechanisms.
  2. Abstracting our code to keep it as generic and provider-agnostic as possible. This makes it easy to swap providers around and plug in new ones whenever we wish.

Metrics and performance testing

There are two main provider metrics we analyze when it comes to signins with 2FA:

  1. Time from code generation to signin (aka: How long did the provider take to deliver the code?)
  2. Number of times a code is requested to be resent (aka: Was the provider reliable in delivering the code?)

These metrics allow us to analyze a providers’ quality over time, and allow us to make informed choices such as which provider we should use for members in specific geographical locations.

In order to collect this data from multiple providers, we make heavy use of A/B testing. This approach lets us easily balance the usage of providers A and B one week, and B and C the next. Finally, from a SPOF and resiliency point of view this architecture also makes it painless to fail over to another provider if one goes down.

In closing, we hope you’ll give these new features a shot by visiting your Security Settings page, and we’re excited to continue building proactive security mechanisms for our members.



This post was co-written by Kyle Barry and Zane Lackey on behalf of the Etsy Security Team.

5 responses

« newer postsolder posts »

Recent Posts

About

Etsy At Etsy, our mission is to enable people to make a living making things, and to reconnect makers with buyers. The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.

Code as Craft is proudly powered by WordPress.com VIP and the SubtleFlux theme.

© Copyright 2014 Etsy