October 2012 Site Performance Report

Posted by Jonathan Klein | Filed under performance

It’s been about four months since our last performance report, and we wanted to provide an update on where things stand as we go into the holiday season and our busiest time of the year.  Overall the news is very good!

Server Side Performance

Here are the median and 95th percentile load times for core pages, Wednesday 10/24/12:

As you can see, load times declined significantly across all pages.  A portion of this improvement is due to ongoing efforts we are making in Engineering to improve performance in our application code.  The majority of this dip, however, resulted from upgrading all of our webservers to new machines using Sandy Bridge processors.  With Sandy Bridge we saw not only a significant drop in load time across the board, but also a dramatic increase in the amount of traffic that a given server can handle before performance degrades.  You can clearly see when the cutover happened in the graph below:

This improvement is a great example of how operational changes can have a dramatic impact on performance.  We tend to focus heavily on making software changes to reduce load time, but it is important to remember that sometimes vertically scaling your infrastructure and buying faster hardware is the quickest and most effective way to speed up your site.  It’s a good reminder that when working on performance projects you should be willing to make changes in any layer of the stack.

Front-end Performance

Since our last update we have a more scientific way of measuring front-end performance, using a hosted version of WebPagetest.  This enables us to run many synthetic tests a day, and slice the data however we want.  Here are the latest numbers, gathered with IE8 from Virginia over a DSL connection as a signed in Etsy user:

These are median numbers across all of the runs on 10/24/12, and we run tests every 30 minutes.  Most of the pages are slower as compared to the last update, and we believe that this is due to using our hosted version of WebPagetest and aggregating many tests instead of looking at single tests on the public instance.  By design, our new method of measurement should be more stable over the long term, so our next update should give a more realistic view of trends over time.

You might be surprised that we are using synthetic tests for this front-end report instead of Real User Monitoring (RUM) data.  RUM is a big part of performance monitoring at Etsy, but when we are looking at trends in front-end performance over time, synthetic testing allows us to eliminate much of the network variability that is inherent in real user data.  This helps us tie performance regressions to specific code changes, and get a more stable view of performance overall.  We believe that this approach highlights elements of page load time that developers can impact, instead of things like CDN performance and last mile connectivity which are beyond our control.

New Baseline Performance Measurements

Another new thing we created is an extremely basic page that allows us to track the lower limit on load time for Etsy.com.  This page just includes our standard header and footer, with no additional code or assets.  We generate some artificial load on this page and monitor its performance.  This page represents the overhead of our application framework, which includes things like our application configuration (config flag system), translation architecture, security filtering and input sanitization, ORM, and our templating layer.  Having visibility into these numbers is important, since improving them impacts every page on the site.  Here is the current data on that page:

Over the next few months we hope to bring these numbers down while at the same time bringing the performance of the rest of the site closer to our baseline.

24 responses

Scaling User Security

Posted by Zane Lackey | Filed under engineering, security

Summer is ending
New security features
Sweeping in like Fall

The Etsy Security Team is extremely happy to announce the simultaneous release of three important security features: Two factor authentication, full site SSL support, and viewable login history data. We believe that these protections are industry best practice, and we’re excited to offer them proactively to our members on an opt-in basis as a further commitment to account safety. A high level overview of the features is available here, while on Code as Craft we wanted to talk a bit more about the engineering that went into the SSL and two factor authentication features.

Rolling out Full Site SSL

When we initially discussed making the site fully accessible over SSL, we thought it might be a simple change given our architecture at the time. During this time period we relied on our load balancers to both terminate SSL and maintain the logic as to which pages were forced to HTTPS, which were forced to HTTP, and which could be either. To test out our “simple change” hypothesis we set up a test where we attempted to make the site fully SSL by disabling our load balancer rules that forced some pages down to HTTP. After this triggered a thrilling explosion in the error logs, we realized things weren’t going to be quite that easy.

The first step was to make our codebase HTTPS friendly. This meant cleaning up a significant number of hard coded “http://” links, and making all our URL generating functions HTTPS aware. In some cases this meant taking advantage of scheme relative URLs. We also needed to verify that all of our image storage locations, and our various CDNs could all play nicely with SSL.

Next up was moving the logic for enforcing whether a URL could be HTTP, HTTPS, or both from the load balancer to the application itself. With the logic on the load balancer, adding or changing a rule went like this:

  1. Engineer makes a ticket for ops
  2. Ops logs into the load balancer admin interface
  3. Ops needs to update the rule in three places (our development, preprod, and prod environments)
  4. Ops lets engineer know rule was updated, hopes all is well
  5. In case of rollback, ops has to go back into admin interface, find the rule (the admin interface did not sort them in any meaningful way), change the rule back, and hope for the best

In this flow, changes have to be tracked through a change ticket system not source control and there is no way for other engineers to see what has been updated. Why is application logic in our load balancers anyway? Wat?

To address these issues, we moved all HTTPS vs HTTP logic into the web server via a combination of .htaccess rules and hooks in our controller code. This new approach provided us with far greater granularity on how to apply rules to specific URLs. Now we can specify how URLs are handled for groups of users (sellers, admins, etc) or even individual users instead of using load balancer rules in an all-or-nothing global fashion. Finally, the move meant all of this code now lives in git which enables transparency across the organization.

HSTS

HSTS is a new header the instructs browsers to only connect to a specific domain over HTTPS in order to provide a defense against certain man-in-the-middle attacks. As part of our rollout we are making use of this header when a user opts in to full site SSL. Initially we are setting a low timeout value for HSTS during rollout to ensure things operate smoothly, and we’ll be increasing this value via a config push over time as we’re confident there will be no issues.

Why not use full site SSL for all members and visitors?

First and foremost, rolling out SSL as the default for all site traffic is something we’re actively working on and feel is the best practice to be striving towards. As with any large scale change in site-wide traffic, capacity and performance are significant concerns. Our goal with making this functionality available on an opt-in basis at first is to provide it to those members who use riskier shared network mediums such as public WiFi. Going forward, we’re analyzing metrics around CDN performance (especially for our international members), page performance times of SSL vs non-SSL, and overall load balancer SSL capacity. When we’re confident in the performance and capacity figures, we’re excited to continue moving towards defaulting to full site SSL for all members and visitors.

Two factor authentication

Our main focus during the course of our two factor authentication project (aside from security) was how to develop and apply metrics to create the best user experience possible over the long term. Specifically, the questions we wanted to be able to answer about the voice call/SMS delivery providers we selected were:

  • “Does provider A deliver codes faster than provider B?”
  • “Does provider A deliver codes more reliably than provider B?”
  • “Can we easily swap out provider A for provider C? or D? or F?”

Provider abstraction

From the beginning we decided that we did not want to be tied to a single provider, so abstraction was critical. We went about achieving this in two ways:

  1. Only relying on the provider for transmission of the code to a member. All code generation and verification is performed in our application, and the providers are simply used as “dumb” delivery mechanisms.
  2. Abstracting our code to keep it as generic and provider-agnostic as possible. This makes it easy to swap providers around and plug in new ones whenever we wish.

Metrics and performance testing

There are two main provider metrics we analyze when it comes to signins with 2FA:

  1. Time from code generation to signin (aka: How long did the provider take to deliver the code?)
  2. Number of times a code is requested to be resent (aka: Was the provider reliable in delivering the code?)

These metrics allow us to analyze a providers’ quality over time, and allow us to make informed choices such as which provider we should use for members in specific geographical locations.

In order to collect this data from multiple providers, we make heavy use of A/B testing. This approach lets us easily balance the usage of providers A and B one week, and B and C the next. Finally, from a SPOF and resiliency point of view this architecture also makes it painless to fail over to another provider if one goes down.

In closing, we hope you’ll give these new features a shot by visiting your Security Settings page, and we’re excited to continue building proactive security mechanisms for our members.



This post was co-written by Kyle Barry and Zane Lackey on behalf of the Etsy Security Team.

5 responses

Announcing the Etsy Security Bug Bounty Program

Posted by Zane Lackey | Filed under security

On April 17 of this year we launched our responsible disclosure page (http://www.etsy.com/help/article/2463). At the time, our goal was to provide security researchers with a direct point of contact if they had identified a vulnerability in our site, API, or mobile application. Thus far we’ve received excellent reports from researchers, as well as some exciting offers from Nigerian princes.

Today, we’d like to take this a step further and announce the launch our security bug bounty program. Our goal is to reward security researchers who follow responsible disclosure principles and proactively reach out to us if they’ve identified a vulnerability which would impact the safety of our marketplace or members. We believe that this is industry best practice. Our bounty program will pay a minimum of $500 for qualifying vulnerabilities, subject to a few conditions and with qualification determined by the Etsy Security Team. This bounty will be increased at our discretion for distinctly creative or severe security bugs. To give it the proper Etsy feel, we’ll also be throwing in some handmade thank-you’s such as an Etsy Security Team T-shirt. Additionally, we’ll be retroactively applying the bounty to vulnerabilities that have been reported to us since the launch of our responsible disclosure page earlier this year.

You can find the full information on the new program here: http://www.etsy.com/help/article/2463

12 responses

The Engineer Exchange Program

Posted by marcetsy | Filed under engineering, people, philosophy

Co-authors:

  • Marc Hedlund, SVP, Product Development at Etsy
  • Raffi Krikorian, Director, Platform Services at Twitter

Your first week at any new job is (at least if you chose a good job!) filled with tons to learn, new ways of doing things, and working models that you might have considered unattainable in the job you just left. How great would it be to have that experience more than once per new job you take? Twitter and Etsy are working together on a new project to help our engineers learn from each others’ practices, with the idea of making both of our engineering teams better as a result.  We hope to learn what makes each other tick, how we celebrate our successes and learn from our failures, and how we can each be better in the end.

This week, one of Etsy’s Staff Engineers is traveling to San Francisco to spend a week at Twitter, observing and helping out, learning what Twitter does particularly well, and seeing differences that may reinforce or refute beliefs we’ve held as core. Likewise, a Twitter Platform Engineer is traveling to Brooklyn for the week, and watching what Etsy does well and poorly, all while helping out (and, of course, deploying on her first day).

New engineers at Etsy go through a several-week bootcamp, working with different teams to learn the codebase, meet people across the group, and take on small tasks. Likewise, new engineers at Twitter go through a “new hire orientation” process where they learn about the Twitter architecture, see first hand Twitter’s raw scale, and play with the back-end technologies.  These engineers will go through these same steps for the week (albiet, a bit accelerated), contributing code and pushing to production, not just observing from a distance.

It takes a level of trust to let an unknown engineer into the fold, let them sit in on meetings and make changes to code. Of course some people would be uncomfortable with letting this happen; companies we’ve both worked for would have fits before allowing it. But we believe the value of cross-pollination of ideas and practices is far too high to be blocked by these concerns. While this is an experiment, we’re hopeful it makes both teams stronger, and we’ll be looking for other exchanges to do soon.

18 responses

What Hardware Powers Etsy.com?

Posted by Laurie Denness | Filed under engineering, infrastructure, operations

Traditionally, discussing hardware configurations when running a large website is something done inside private circles; and normally to discuss how vendor X did something very poorly, and vendor Y’s support sucks.

With the advent of the “cloud”, this has changed slightly. Suddenly people are talking about how big their instances are, and how many of them. And I think this is a great practice to get in to with physical servers in datacenters too. After all, none of this is intended to be some sort of competition; it’s about helping out people in similar situations as us, and broadcasting solutions that others may not know about… pretty much like everything else we post on this blog.

The great folk at 37signals started this trend recently by posting about their hardware configurations after attending Velocity conference… one of the aforementioned places where hardware gossiping will have taken place.

So, in the interest of continuing this trend here’s the classes of machine we use to power over $69.5 million of sales for our sellers in July

Database Class

As you may already know, we have quite a few pairs of MySQL machines to store our data, and as such we’re relying on them heavily for performance and (a certain amount of) reliability.

For any job that requires an all round performant box, with good storage, good processing power, and a good level of redundancy we utilise HP DL380 servers. These clock in at 2U of rack space, 2x 8 core Intel E5630 CPUs (@ 2.53ghz), 96GB of RAM (for that all important MySQL buffer cache) and 16x 15,000 RPM 146GB hard disks. This gives us the right balance of disk space to store user data, and spindles/RAM to retrieve it quickly enough. The machines have 4x 1gbit ethernet ports, but we only use one.

Why not SSDs?

We’re just starting to test our first round of database class machines with SSDs. Traditionally we’ve had other issues to solve first, such as getting the right balance of amount of user data (e.g. the amount of disk space used on a machine) vs the CPU and memory. However, as you’ll see in our other configs, we have plenty of SSDs throughout the infrastructure, so we certainly are going to give them a good testing for databases too.

A picture of our various types of hardware, with the HP to the left and web/utility boxes on the right

A picture of our various types of hardware, with the HP to the left/middle and web/utility boxes on the right

Web/Gearman Worker/Memcache/Utility/Job Class

This is a pretty wide catch all, but in general we try and favour as few machine classes as possible, so a lot of our tasks from handling web traffic (Apache/PHP) to any box that performs a task where there are many of them/redundancy is solved at the app level we generally use one type of machine. This way hardware re-use is promoted and machines can change roles quickly and easily. Having said that, there are some slightly different configurations in this category for components that are easy to change, e.g. amount of memory and disks.

We’re pretty much in love with this 2U Supermicro chassis which allows for 4x nodes that share two power supplies and 12 3.5″ disks on the front of the chassis

Supermicro Chassis with 4 easily serviable nodes

Supermicro Chassis with 4 easily serviceable nodes

A general configuration for these would be 2x 8 core Intel E5620 CPUs (@ 2.40ghz), 12GB-96GB of RAM, and either a 600GB 7200pm hard disk or an Intel 160GB SSD.

Note the lack of RAID on these configurations; We’re pretty heavily reliant on Cobbler and Chef, which means rebuilding a system from scratch takes just 10 minutes. In our view, why power two drives when our datacenter staff can replace the drive and rebuild the machine and have it back in production in under 20 minutes? Obviously this only works where it is appropriate; clusters of machines where the data on each individual machine is not important. Web servers, for example, have no important data since logs are sent constantly to our centralised logging host, and the web code is easily deployed back on to the machine.

We have Nagios checks to let us know when the filesystem becomes un-writeable (and SMART checks also), so we know when a machine needs a new disk.

Each machine has 2x 1gbit ethernet ports, in this case we’re only using one.

Hadoop

In the last 12 months we’ve been working on building up our Hadoop cluster, and after evaluating a few hardware configurations ended up with a very similar chassis design to the one used above. However, we’re using a chassis with 24x 2.5″ disk slots on the front, instead of the 12x 3.5″ design used above.

Hadoop nodes... and a lot of disk lights

Hadoop nodes… and a lot of disk lights

Each node (with 4 in a 2U chassis) has 2x 12 core Intel E5646 CPUs (@ 2.40ghz), 96GB of RAM, and 6x 1Tb 2.5″ 7200rpm disks. That’s 96 cores, 384GB of RAM and 24TB per 2U of rack space.

Our Hadoop jobs are very CPU heavy, and storage and disk throughput is less of an issue hence the small amount of disk space per node. If we had more I/O and storage requirements, we had also considered 2U Supermicro servers with 12x 3.5″ disks per node instead.

As with the above chassis, each node as 2x 1gbit ethernet ports, but we’re only utilising one at the minute.

The difference in power usage on one power strip showing the difference between jobs running and not

This graph illustrates the power usage on one set of machines showing the difference between Hadoop jobs running and not

Search/Solr

Just a month ago, this would’ve been grouped into the general utility boxes above, but we’ve got something new and exciting for our search stack. Using the same chassis as in our general example, but this time using the awesome new Sandy Bridge line of Intel CPUs. We’ve got 2x 16 core Intel E5-2690 CPUs in these nodes, clocked at 2.90ghz, which gives us machines that can handle over 4 times the workload of the generic nodes above, whilst using the same density configuration and not that much more power. That’s 128x 2.9ghz CPU cores per 2U (granted, that includes HyperThreading).

This works so well because search is really CPU bound; we’ve been using SSDs to get around I/O issues in these machines for a few years now. The nodes have 96GB of RAM and a single 800GB SSD for the indexes. This follows the same pattern of not bothering with RAID; The SSD is perfectly fast enough on it’s own, and we have BitTorrent index distribution which means getting the indexes to the machine is super fast.

Less machines = less to manage, less power, and less space.

Output of the "top" command with 32 cores

Output of the “top” command with 32 cores on Sandy Bridge architecture

Backups

Supermicro wins this game too. We’re using the catchily named 6047R-E1R36N. The 36 in this model number is the important part… this is a 4u chassis, with 36x 3.5″ disks. We load up these chassis with 2TB 7200rpm drives, which when coupled with an LSI RAID controller with 1gb of battery backed write back cache gives a blistering 1.2 gigabytes/second sequential write throughput and a total of 60TB of usable disk space across two RAID6 volumes.

36 disk Supermicro chassis

36 disk Supermicro chassis. Note the disks crammed into the back of the chassis as well as the front!

Why two RAID6 volumes? Well, it means a little more waste (4 drives for parity instead of 2) but as a result of that you do get a bit more resiliency against losing a number of drives, and rebuild times are halved if you just lose a single drive. Obviously RAID monitoring is pretty important, and we have checks for either SMART (single disk machines) or the various RAID utilities on all our other machines in Nagios.

In this case we’re taking advantage of the 2x 1gbit ethernet connections, bonded together to the switch to give us redundancy and the extra bandwidth we need. In the future we may even run fiber to these machines, to get the full potential out of the disks, but right now we don’t get above a gigabit/second for all our backups.

Special cases

Of course there are always exception to the rules. The only other hardware profile we have is HP DL360 servers (1u, 4x 2.5″ 15,000rpm 146GB SAS disks) which is for roles that don’t need much horsepower, but we deem important enough to have RAID. For example, DNS servers, LDAP servers, and our Hadoop Namenodes are all machines that don’t require much disk space, but need RAID for extra data safety than our regular single disk configurations. 

Networking

I didn’t go into too much detail on the networking side of things in this post. Consider this part 1, and watch this space for our networking gurus to take you through our packet shuffling infrastructure at a later date.

Continue the trend

If you’re anything like us, we love a good spot of hardware porn. What cool stuff do you have?

 

This post was Laurie Denness (@lozzd), who would love it if you came and helped us make Etsy.com even better using this hardware. Why not come and help us?

35 responses

Posting PostMortems for a (generally) Non-Technical Audience

Posted by jallspaw | Filed under engineering, infrastructure, operations, outages

The other day I posted to the Etsy News blog about some recent outages we’ve had.

We haven’t given this much information about site outages in the past, and this particular post was written for the non-technical-minded members of the community. The process of writing it was a challenge for me.

It underscored the lesson that you can’t fully appreciate how complicated some of the failure scenarios we see in our field of web operations until you actually want to explain it to someone who isn’t familiar with software and infrastructure fundamentals. :)

In the end, I got a lot of positive feedback on it from a number of Etsy members. This includes sellers who spend a large deal of their time making the things they sell, which is our goal.

Generally, we want to make the technical bits of Etsy to disappear for them. But when things go wrong, I think it’s worth giving them some details about what happened and what we’re doing to help avoid similar issues in the future.
If you’re interested, here is the post.

1 response

Performance tuning syslog-ng

Posted by avleenetsy | Filed under infrastructure, operations

You may have noticed a theme in many of our blog posts. While we do push the technology envelope in many ways, we also like to stick to triedtrue and data driven methods to help us keep us sane.

One of the critical components of our infrastructure is centralised logging. Our systems generate a vast amount of logging, and we like to keep it all. Everything. The whole bag.

When faced with similar situations, other companies have embraced a lot of new technology:

  • Flume
  • Scribe
  • Logstash

At Etsy, we’ve stayed old-school – practically prehistoric by technology standards. Our logging infrastructure still uses syslog as the preferred transport because it Just Works™.

…Mostly.

Syslog-ng is a powerful tool, and has worked well, as long as we’ve paid a little attention to performance tuning it.
This a collection of our favourite syslog-ng tuning tips.

Our central syslog server, an 8-core server with 12Gb RAM, currently handles around 60,000 events per second at peak at ~25% CPU load.

Rule ordering

If, like us, you have a central syslog server with all of your hosts sending their vital information (whether it be system logs, Apache logs, Squid logs, or anything else) you will probably have a lot of filters set up to match certain hostnames, and certain log lines so you can get the logs into their correct places.

For example, you may be trying to find all Apache access log lines across all of your webservers, so you end up with all the logs in a single file destination (something like /log/web/access.log perhaps)

This feature is pretty widely used, and extremely powerful in the sorting of your logs but it’s also very power consuming in terms of CPU time. Regex matching especially has to check every log line event that comes in, and when you’re pushing tens of thousands of them, it can begin to hurt.

But, in the case of our Apache access log example, you would use the “flags(final)” attribute in order to tell syslog-ng to stop processing that line, so it doesn’t even have to check the other regexes. That is all well and good, but have you considered internally what order syslog-ng is checking those match statements?

For example, we use Chef pretty extensively at Etsy to automate a lot of things; syslog-ng is one of them. Each machine role has it’s own syslog-ng definitions, and our Chef recipes automatically generate both the client and server config for use with that role. To do this, we template out the configuration, and drop the files into /etc/syslog-ng.d/, which is included from the main syslog-ng.conf.

Now, if your biggest traffic (log wise) happens to be for your webservers, and your config file ends up being called “web.conf”, syslog-ng will quite happily parse all your configs and when it compiles it’s rules, the config that checks to see if the line is from Apache will end up at the end of the list. You are potentially running tens if not hundreds of other match statements and regexes for no reason what so ever for the bulk of your traffic.

Luckily the fix is extremely simple: If using one config file, keep the most used rules at the top. If you use syslog-ng.d, keep your most used rules in a file that begins with a “0-” to force it to the top of the list. This tiny change alone halved the CPU we were using on our syslog-ng server.

tl;dr: Make sure syslog-ng is parsing the most frequently logged lines first, not last. 

Log windows and buffer sizing

Math time! Syslog-ng has a great feature called “flow control” – when your servers and applications send more traffic than syslog-ng can handle, it will buffer the lines in memory until it can process them. The alternative would be to drop the log lines, resulting in data loss.

Four variables associated with flow control that we will look at, are:

  • log_fifo_size – The size of the output buffer. Each destination has its own one of these. If you set it globally, one buffer is created for each destination, of the size you specify.
  • log_iw_size – The initial size of the input control window. Syslog-ng uses this as a control to see if the destination buffer has enough space before accepting more data. Applies once to each source (not per-connection!)
  • log_fetch_limit – The number of messages to pull at a time from each source. Applies to every connection of the source individually.
  • max_connections – The maximum number of TCP connections to accept. You do use TCP, don’t you?!

Each of your servers shouldn’t open more than one TCP connection to your central syslog server, so make sure you set max_connections to more than that number.

After that, it’s time to break out your calculator and commit these equations to memory:

log_iw_size = max_connections * log_fetch_limit
log_fifo_size = log_iw_size * (10~20)

There is some variance on how you calculate log_fifo_size which you will have to experiment with. The principle is this:

Syslog-ng will fetch at most log_fetch_limit * max_connections messages each time it polls the sources. Your log_fifo_size should be able to hold many polls before it fills up. When your destination (file on disk? another syslog server?) is not able to accept messages quickly enough, they will accumulate in the log_fifo buffer, so make this BIG.

log_fifo_size = (log_iw_size = max_connections * log_fetch_limit) * 20

Experiment with disabling power saving features

At Etsy, we take pride in trying to do our bit for the planet. It’s what our business is built on, trying to encourage good behaviours around recycling, and re-using precious resources. It’s why we announced earlier this year that we are now a member of the B-Corp movement so we can measure our success in giving back to the world in various ways. One of the criteria for this involves how much energy we use as an organisation; a tough subject when you have a datacenter which are well known for using huge amounts of power. Because of this, we pride ourselves in working with vendors that also care about their power footprint, and being able to squeeze the most savings without effecting performance.

However, this isn’t to say that, in particular, power saving modes provided with servers are perfect. We have hundreds of servers that can scale back their CPUs and other devices, saving power with absolutely 0 measured effect to performance or any other metric (and, believe us, we tested). There are two recorded cases where this hasn’t worked out well for us.

Briefly, the principle behind power saving in servers is to “scale back” (that is, to decrease the speed and thus power usage) of CPU performance when the server doesn’t need it, and with many modern machines having multiple CPUs they can often be the biggest power draw, so the savings involved here are huge.

Power usage on a typical server

Power usage on a typical server. This server would use 472 watts of power if the CPUs were constantly at full speed, the maximum they hit in the last 24 hours was 280 watts.

What if your server demands a lot of CPU power? Well that’s no problem, the CPU can scale back up to full speed instantly, with basically no effect to the response time.

But what if your server does a lot of context switching, or has very bursty CPU usage? Two prime examples of this are Nagios, and syslog-ng. Nagios, for example, has to spin up processes to execute checks in a very spiky manner, and to add to this there are sometimes hundreds of them at once, so the cost of switching between doing tasks in all those processes (even if the time of each individual one on it’s own is tiny) is huge (this is known as context switching). A similar thing happens with syslog-ng, wher the incoming log events are very bursty, so the CPU is actually spending more time scaling back and fourth than doing our processing.

In these two instances, we switched from power saving mode to static full power mode, and the amount of CPU consumed halved. More importantly, that CPU shows up as system time; time wasted context switching and waiting for other events is suddenly reduced dramatically, and all CPU cores can operate on whatever events are needed as soon as possible.

There are some great tools that allow you to watch your CPU scaling in action, for example i7z which is a great command line tool (and UI, if you fancy) that is easy to get running and gives you a great view into the inner workings of Intel CPUs.

The important point here is that we would’ve actually purchased a more power hungry machine to scale with our log traffic if we hadn’t have found this, somewhat defeating the purpose of the power saving feature in the first place.

tl;dr: Experiment with power settings on your server, unlock their full potential, if it makes no difference then put it back. 

Summary

In total these changes took about 5 days of research and testing, and 1/2 day of effort to implement.
We were regularly hitting the limits of what this servers was capable of processing before making these changes and had considered buying much larger servers and moving to other methods of log aggregation.

A small dose of profiling and planning has reduced the load on the server to 1/5th, reduced our power consumption and improved the efficiency of the whole system:

Drop in CPU usage after these tweaks

This graph illustrates the decrease in CPU usage when we performed two of the steps above. 70% CPU consumed to 15% with two very simple steps.

This post was co-written by Avleen Vig (@avleen) and Laurie Denness (@lozzd) battling to solve hard (and sometimes easy) problems to make operations better for everyone. Why not come and help us?

28 responses

Static Analysis for PHP

Posted by Nick Galbreath | Filed under engineering, infrastructure, security

At Etsy we have three tiers of static analysis on our PHP code that run on every commit or runs periodically every hour. They form an important part of our continuous deployment pipeline along with one-button deploys, fast unit and functional tests, copious amounts of graphing, and a fantastic development environment to make sure code flows safely and securely to production.

Sanity and Style Checking

These checks eliminate basic problems and careless errors.

A obvious rule is that syntax errors never make it make into the source repository, let alone production. So “php -l” is run as pre-commit check on each changed file.

We don’t use PHP’s native templating where code and HTML is mixed in one file. Our PHP files are purely code, and output is rendered via another templating system. To make sure this works correctly, we check that there is no text before the initial <?php tag. Otherwise that text is sent to the client and prevents us from setting HTTP headers. Likewise we make sure that PHP tags are balanced and there is no trailing text as well.

There are also a number of basic coding style checks. Nothing particularly exotic but they help make the code readable and consistent across an ever growing engineering department. Most of these are implemented using CodeSniffer.

Formal Checks using HpHp

Facebook’s HipHop for PHP is a full reimplementation of the PHP/Apache stack including a compiler, a new PHP runtime, and a web server. To make your application run under it will require some serious surgery since it’s missing many modules you might depend on. However, it has a fantastic static analyzer which can be used independently. It does global analysis of your entire code base for:

  • Too many or too few arguments in a function/method call
  • Undeclared global or local variables
  • Use of return value of a function that actually returns nothing
  • Functions that have a required argument after an optional one
  • Unknown functions, methods, or base classes
  • Constants declared twice

and a few others. It’s doesn’t always track exactly the latest PHP versions, so you’ll have to whitelist some of the errors, but overall it’s been wildly successful with almost no false positives.

Why not do all this in using PHP’s built-in tokenizer token_get_all or CodeSniffer? HpHp can analyze 10,000 files in a few seconds, while the built-in function is a few orders of magnitude slower. Since it’s so fast, we can put the static analysis as a pre-commit hook that prevents bugs from even being checked in. Which it does almost every day.

Security Checks

We run ClamAV and antivirus check our source repository. Has it found anything yet? No (phew). But at 1GB/sec scan rate and the low low cost of free, there is no reason not do it. We aren’t worried so much about PHP code, but occasionally Word and PDF documents are put into the source repo. ClamAV also scans for URLs and matches them against the Google Safe Browsing database for malware and phish sites.

The other security checks are alerts and triggers for code reviews. Files are scanned for commonly misused or abused functions involving cryptography, random numbers, and process management. If new instances are found, a alert for a code-review is done. Many times the code is just fine , but sometimes adjustments are needed or the goal can be achieved without using crytopgrahy. Likewise we alert on functions that “take a password” such as ftp_login. We want to avoid passwords being checked into source control. Some files are sensitive enough that any change triggers an alert for full review.

A lot more detail can be found in the presentation Static Analysis for PHP first given at PHPDay, Verona, Italy on May 19, 2012. They put on a great conference and highly recommend it for next year.

7 responses

No more locks with Xtrabackup

Posted by Arie Kachler | Filed under databases, operations

Percona’s Xtrabackup is a great product. It performs binary backups of heavily loaded Mysql servers amazingly fast. We’ve used it here at Etsy for years and works very well for us. Mysqldump is another way of doing backups, but for servers with hundreds of gigabytes of data, it’s too slow to be useful, especially for restoring a backup. It can literally take days to restore a couple of hundred of gigabytes generated with Mysqldump. Usually when you need to restore from a backup, you’re in some sort of emergency, and waiting days is not an option.

But Xtrabackup has a significant shortcoming: it needs a global lock at the end of its procedure. Not good for a server doing hundreds or thousands of queries per second. Percona touts Xtrabackup as “non-blocking”, which is for the most part true, but not entirely.

Restoring a Mysql server, most of the time, involves installing an OS, Mysql, and any additional packages you want. Then restoring the latest Xtrabackup data and finally give the new server replication coordinates where it left off before the crash so it can get to the point where its data is consistent with its master. Here’s where the lock comes in: to get a reliable reading of replication coordinates, Xtrabackup issues a “FLUSH TABLES WITH READ LOCK” in the final stage of its process. When the lock is granted, Xtrabackup reads the “MASTER STATUS” and releases the lock. FTWRL is very disruptive to a busy server. It tells Mysql to start the process of read/write locking all tables, which in turn causes all new connections to wait for the lock to be released. Mysql then waits for all outstanding queries to finish, and then grants the lock. If there’s a long running query when FTWRL is requested, you will undoubtedly get a query pile-up which can quickly overwhelm the maximum number of connections your server is configured to accept and your application will stall.

Percona’s documentation states that there is a –no-lock option for Xtrabackup. It also states “Use this option to disable table lock with FLUSH TABLES WITH READ LOCK. Use it only if ALL your tables are InnoDB and you DO NOT CARE about the binary log position of the backup”. We don’t want any locks, but we do want the “binary log position of the backup”, aka replication coordinates.

It turns out that replication coordinates are hidden in the backup files when you run it with the –no-lock option. You just have to know how to get to them. Xtrabackup’s backup procedure involves copying Mysql’s data files to another location, knowing the exact point-in-time when the copying started, and creating an extra file, named xtrabackup_logfile, which contains all writes that occurred during the copying time. Restoring a backup with Xtrabackup requires a “prepare” phase, which is basically applying all writes from the xtrabackup_log file onto data files. When you do the prepare phase, a new file named xtrabackup_binlog_pos_innodb will appear in the restore directory. This file contains the replication coordinates that we need to reestablish replication. With or without the –no-lock option, xtrabackup_binlog_pos_innodb is created in your restore directory!

Even a 1-second stall can be disruptive for a busy server, but our locks used to last around 30 seconds. That was before we adopted the –no-lock option combined with getting replication coordinates from the xtrabackup_binlog_pos_innodb for restores.

An important thing to note is that this works if you only use Innodb tables. You shouldn’t be using MyISAM tables anyway. Use them only for the unavoidable: the `mysql` db which contains grants and other internal metadata.

6 responses

Better Random Numbers in PHP using /dev/urandom

Posted by Nick Galbreath | Filed under security

The design of PHP’s basic random number generators rand and it’s newer variant mt_rand is based off the C Standard Library. For better or worse, both use a single global state and this can be reset using stand (or mt_srand). This means anyone (a developer, a third party module, a library) could set the state to a fixed value and every random number following will be the same for every request. Sometimes this is the desired behavior but this can also have disastrous consequences. For instance, everyone’s password reset code could end up being the same.

Recently, Argyros and Kiayias in I Forgot Your Password: Randomness Attacks Against PHP Applications suggests there might be more fundamental problems in how PHP constructs the state of the random number generator. Just by seeing the output of a few calls to rand or mt_rand, one can predict the next output. With this, and certain password reset implementations, an attacker could perform account takeover. (This paper is also going to be presented on July 25 at Black Hat USA).

Quite some time ago, Etsy switched over to a different way of generating random numbers by using /dev/urandom that prevents both issues. /dev/urandom is a special psuedo-file on unix-like operating systems that generates “mostly random” bytes and is non-blocking. /dev/random (with no “u“) is for truly cryptographic applications such as key generation and is blocking. Once you exhaust it’s supply of randomness it blocks until it distills new randomness from the environment. Therefore, you don’t want to use /dev/random in your web application. To see why, connect to a (non-production!) remote machine and type in “cat /dev/random > /dev/null“, and the in another window try to log in. You won’t be able to, since SSH can’t read from /dev/random and therefore can’t complete the connection.

A pedagogical replacement of rand, mt_rand with /dev/urandom using the mcrypt module might be:

                                               
// equiv to rand, mt_rand
// returns int in *closed* interval [$min,$max]                                                
function devurandom_rand($min = 0, $max = 0x7FFFFFFF) {
    $diff = $max - $min;
    if ($diff < 0 || $diff > 0x7FFFFFFF) {
	throw new RuntimeException("Bad range");
    }
    $bytes = mcrypt_create_iv(4, MCRYPT_DEV_URANDOM);
    if ($bytes === false || strlen($bytes) != 4) {
        throw new RuntimeException("Unable to get 4 bytes");
    }
    $ary = unpack("Nint", $bytes);
    $val = $ary['int'] & 0x7FFFFFFF;   // 32-bit safe                           
    $fp = (float) $val / 2147483647.0; // convert to [0,1]                          
    return round($fp * $diff) + $min;
}

A long time ago, Etsy didn’t even have mcrypt installed and so we read directly from /dev/urandom using open and fread (see also stream_set_read_buffer).

Using /dev/urandom is quite a bit slower than using the native php functions. However, as a percentage of total page load time, it’s effectively free and unnoticeable.  Do we graph this? Yes we do!  Here’s a days worth of random:

Note that the above code converting bytes to an integer will demonstrate a slight bias with very large ranges, so we can’t use for it with Etsy’s monte-carlo long-range simulation forecasting hand-made supercomputer but for all the other (non-cryptographic) web applications likely to be. For other algorithms and details on this topic, the main reference is Knuth’s Art of Computer Programming: Seminumerical Algorithms. A more modern treatment can be found in any of the Numerical Recipes books. The Java source code for java.util.Random is also a good reference. Enjoy!

3 responses

« newer postsolder posts »

Recent Posts

About

Etsy At Etsy, our mission is to enable people to make a living making things, and to reconnect makers with buyers. The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.

Code as Craft is proudly powered by WordPress.com VIP and the SubtleFlux theme.

© Copyright 2014 Etsy