December 2013 Site Performance Report

Posted by on January 23, 2014 / 4 Comments

It’s a new year, and we want to kick things off by filling you in on site performance for Q4 2013. Over the last three months front-end performance has been pretty stable, and backend load time has increased slightly across the board.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, December 18th:

Server Side Performance December 2013

There was an across the board increase in both median and 95th percentile load times over the last three months, with a larger jump on our search results page. There are two main factors that contributed to this increase: higher traffic during the holiday season and an increase in international traffic, which is slower due to translations. On the search page specifically, browsing in US English is significantly faster than any other language. This isn’t a sustainable situation over the long term as our international traffic grows, so we will be devoting significant effort to improving this over the next quarter.

Synthetic Front-end Performance

As usual, we are using our private instance of WebPagetest to get synthetic measurements of front-end load time. We use a DSL connection and test with IE8, IE9, Firefox, and Chrome. The main difference with this report is that we have switched from measuring Document Complete to measuring Speed Index, since we believe that it provides a better representation of user perceived performance. To make sure that we are comparing with historical data, we pulled Speed Index data from October for the “old” numbers. Here is the data, and all of the numbers are medians over a 24 hour period:

Synthetic Front-End Performance December 2013

Start render didn’t really change at all, and speed index was up on some pages and down on others. Our search results page, which had the biggest increase on the backend, actually saw a 0.2 second decrease in speed index. Since this is a new metric we are tracking, we aren’t sure how stable it will be over time, but we believe that it provides a more accurate picture of what our visitors are really experiencing.

One of the downsides of our current wpt-script setup is that we don’t save waterfalls for old tests – we only save the raw numbers. Thus when we see something like a 0.5 second jump in Speed Index for the shop page, it can be difficult to figure out why that jump occurred. Luckily we are Catchpoint customers as well, so we can turn to that data to get granular information about what assets were on the page in October vs. December. The data there shows that all traditional metrics (render start, document complete, total bytes) have gone down over the same period. This suggests that the jump in speed index is due to loading order, or perhaps a change in what’s being shown above the fold. Our inability to reconcile these numbers illustrates a need to have visual diffs, or some other mechanism to track why speed index is changing. Saving the full WebPagetest results would accomplish this goal, but that would require rebuilding our EC2 infrastructure with more storage – something we may end up needing to do.

Overall we are happy with the switch to speed index for our synthetic front-end load time numbers, but it exposed a need for better tooling.

Real User Front-end Performance

These numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

Real User Front-end Performance December 2013

There aren’t any major changes here, just slight movement that is largely within rounding error. The one outlier is search, especially since our synthetic numbers showed that it got faster. This illustrates the difference between measuring onload, which mPulse does, and measuring speed index, which is currently only present in WebPagetest. This is one of the downsides of Real User Monitoring – since you want the overhead of measurement to be low, the data that you can capture is limited. RUM excels at measuring things like redirects, DNS lookup times, and time to first byte, but it doesn’t do a great job of providing a realistic picture of how long the full page took to render from the customer’s point of view.

Conclusion

We have a backend regression to investigate, and front-end tooling to improve, but overall there weren’t any huge surprises. Etsy’s performance is still pretty good relative to the industry as a whole, and relative to where we were a few years ago. The challenge going forward is going to center around providing a great experience on mobile devices and for international users, as the site grows and becomes more complex.

4 Comments

Static Analysis with OCLint

Posted by on January 15, 2014 / 2 Comments

At Etsy, we’re big believers in making tools do our work for us.

On the mobile apps team we spend most of our time focused on building new features and thinking about how the features of Etsy fit into an increasingly mobile world. One of the great things about working at Etsy is that we have a designated period called Code Slush around the winter holidays where product development slows down and we can take stock of where we are and do things that we think are important or useful, but that don’t fit into our normal release cycles. Even though our apps team releases significantly less frequently than our web stack, making it easier to continue developing through the holiday season, we still find it valuable to take this time out at the end of the year.

This past slush we spent some of that time contributing to the OCLint project and integrating it into our development workflow. OCLint, as the name suggests, is a linter tool for Objective-C. It’s somewhat similar to the static analyzer that comes built into Xcode, and it’s built on the same clang infrastructure. OCLint is a community open source project and all of the changes to it we’re discussing have been contributed back and are available with the rest of OClint on their github page.

If you run OCLint on your code it will tell you things like, “This method is suspiciously long” or “The logic on this if statement looks funny”. In general, it’s great at identifying these sorts of code smells. We thought it would be really cool if we could extend it to find definite bugs and to statically enforce contracts in our code base. In the remainder of this post, we’re going to talk about what those checks are and how we take advantage of them, both in our code and in our development process.

Rules

Objective-C is a statically typed Object Oriented language. Its type system gets the job done, but it’s fairly primitive in certain ways. Often, additional contracts on a method are specified as comments. One thing that comes up sometimes is knowing what methods a subclass is required to implement. Typically this is indicated in a comment above the method.

For example, UIActivity.h contains the comment // override methods above a list of several of its methods.

This sort of contract is trivial to check at compile time, but it’s not part of the language, making these cases highly error prone. OCLint to the rescue! We added a check for methods that subclasses are required to implement. Furthermore, you can use the magic of Objective-C categories to mark up existing system libraries.

To mark declarations, oclint uses clang’s __attribute__((annotate(“”))) feature to pass information from your code to the checker.
To make these marks on a system method like the -activityType method in UIActivity, you would stick the following in a header somewhere:

@interface UIActivity (StaticChecks)
...
- (NSString *)activityType
__attribute__((annotate(“oclint:enforce[subclass must implement]”)));
...
@end

That __attribute__ stuff is ugly and hard to remember so we #defined it away.

#define OCLINT_SUBCLASS_MUST_IMPLEMENT 
__attribute__((annotate(“oclint:enforce[subclass must implement]”)))

Now we can just do:

@interface UIActivity (StaticChecks)
...
- (NSString *)activityType OCLINT_SUBCLASS_MUST_IMPLEMENT;
...
@end

 

We’ve contributed back a header file with these sorts of declarations culled from the documentation in UIKit that anyone using oclint can import into their project. We added this file into our project’s .pch file so it’s included in every one one of our classes automatically.

Some other checks we’ve added:

Protected Methods

This is a common feature in OO languages – methods that only a subclass and its children can call. Once again, this is usually indicated in Objective-C by comments or sometimes by sticking the declarations in a category in separate header. Now we can just tack on OCLINT_PROTECTED_METHOD at the end of the declaration. This makes the intent clear, obvious, and statically checked.

Prohibited Calls

This is another great way to embed institutional knowledge directly into the codebase. You can mark methods as deprecated using clang, but this is an immediate compiler error. We’ll talk more about our workflow later, but doing it through oclint allow us to migrate from old to new methods gradually and easily use things while debugging that we wouldn’t want to commit.

We have categories on NSArray and NSDictionary that we use instead of the built in methods, as discussed here. Marking the original library methods as prohibited lets anyone coming into our code base know that they should be using our versions instead of the built in ones. We also have a marker on NSLog, so that people don’t accidentally check in debug logs. Frequently the replacement for the prohibited call calls the prohibited call itself, but with a bunch of checks and error handling logic. We use oclint’s error suppression mechanism to hide the violation that would be generated by making the original call. This is more syntactically convenient than dealing with clang pragmas like you would have to using the deprecated attribute.

Ivar Assignment Outside Getters

We prefer to use properties whenever possible as opposed to bare ivar accesses. Among other things, this is more syntactically and semantically regular and makes it much easier to set breakpoints on changes to a given property when debugging.  This rule will emit an error when it sees an ivar assigned outside its getter, setter, or the owning class’s init method.

-isEquals: without -hash

In Cocoa, if you override the -isEquals: method that checks for object equality, it’s important to also override the -hash method. Otherwise you’ll see weird behavior in collections when using the object as a dictionary key. This check finds classes that implement -isEquals: without implementing -hash. This is another great example of getting contracts out of comments and into static checks.

Workflow

We think that oclint adds a lot of value to our development process, but there were a couple of barriers we had to deal with to make sure our team picked it up. First of all, any codebase written without oclint’s rules strictly enforced for all of development will have tons of minor violations. Sometimes the lower priority things it warns about are actually done deliberately to increase code clarity. To cut down on the noise we went through and disabled a lot of the rules, leaving only the ones we thought added significant value. Even with that, there were still a number of things it complained frequently about – things like not using Objective-C collection literals. We didn’t want to go through and change a huge amount of code all at once to get our violations down to zero, so we needed a way to see only the violations that were relevant to the current change. Thus, we wrote a little script to only run oclint on the changed files. This also allows us to easily mark something as no longer recommended without generating tons of noise, having to remove it entirely from our codebase, or fill up Xcode’s warnings and errors.

Finally, we wanted to make it super easy for our developers to start using it. We didn’t want to require them to run it manually before every commit. That would be just one more thing to forget and one more thing anyone joining our team would have to know about. Plus it’s kind of slow to run all of its checks on a large codebase. Instead, we worked together with our terrific testing and automation team to integrate it into our existing github pull request workflow. Now, whenever we make a pull request, it automatically kicks off a jenkins job that runs oclint on the changed files. When the job is done, it posts a summary a comment right to the pull request along with a link to the full report on jenkins. This ended up feeling very natural and similar to how we interact with the php code sniffer on our web stack.

Conclusion

We think oclint is a great way to add static checks to your Cocoa code. There are some interesting things going on with clang plugins and direct Xcode integration, but for now we’re going to stick with oclint. We like its base of existing rules, the ease of gradually applying its rules to our code base, and its reporting options and jenkins integration.

We also want to thank the maintainer and the other contributors for the hard work they’ve been put into the project. If you use these rules in interesting ways, or even boring ones, we’d love to hear about it. Interested in a working at a place that cares about the quality of its software and about solving its own problems instead of just letting them mount? Our team is hiring!

2 Comments

Android Staggered Grid

Posted by on January 13, 2014 / 3 Comments

While building the new Etsy for Android app, a key goal for our team was delivering a great user experience regardless of device. From phones to tablets, phablets and even tabphones, we wanted all Android users to have a great shopping experience without compromise.

A common element throughout our mobile apps is the “Etsy listing card”. Listing cards are usually the first point of contact users have with our sellers’ unique items, whether they’re casually browsing through categories or searching for something specific. On these screens, when a listing card is shown, we think our users should see the images without cropping.

Android Lists and Grids

A simple enough requirement but on Android things aren’t always that simple. We wanted these cards in a multi-column grid, with a column count that changes with device orientation while keeping grid position. We needed header and footer support, and scroll listeners for neat tricks like endless loading and quick return items. This wasn’t achievable using a regular old ListView or GridView.

Furthermore, both the Android ListView and GridView are not extendable in any meaningful way. A search for existing open libraries didn’t reveal any that met our requirements, including the unfinished StaggeredGridView available in the AOSP source.

Considering all of these things we committed to building an Android staggered grid view. The result is a UI component that is built on top of the existing Android AbsListView source for stability, but supports multiple columns of varying row sizes and more.

Android Trending - Devices

How it Works

The StaggeredGridView works much like the existing Android ListView or GridView. The example below shows how to add the view to your XML layout, specifying the margin between items and the number of columns in each orientation.

<com.etsy.android.grid.StaggeredGridView
    xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:app="http://schemas.android.com/apk/res-auto"
    android:id="@+id/grid_view"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    app:item_margin="8dp"
    app:column_count_portrait="2"
    app:column_count_landscape="3" />

You can of course set the layout margin and padding as you would any other view.

To show items in the grid create any regular old ListAdapter and assign it to the grid. Then there’s one last step. You need to ensure that the ListAdapter’s views maintain their aspect ratio. When column widths adjust on rotation, each item’s height should respond.

How do you do this? The AndroidStaggeredGrid includes a couple of utility classes including the DynamicHeightImageView which you can use in your adapter. This custom ImageView overrides onMeasure() and ensures the measured height is relative to the width based on the set ratio. Alternatively, you can implement any similar custom view or layout with the same measurement logic.

public void setHeightRatio(double ratio) {
    if (ratio != mHeightRatio) {
        mHeightRatio = ratio;
        requestLayout();
    }
}

@Override
protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec) {
    if (mHeightRatio > 0.0) {
        int width = MeasureSpec.getSize(widthMeasureSpec);
        int height = (int) (width * mHeightRatio);
        setMeasuredDimension(width, height);
    }
    else {
        super.onMeasure(widthMeasureSpec, heightMeasureSpec);
    }
}

And that’s it. The DynamicHeightImageView will maintain the aspect ratio of your items and the grid will take care of recycling views in the same manner as a ListView. You can check out the GitHub project for more details on how it’s used including a sample project.

But There’s More

Unlike the GridView, you can add header and footer views to the StaggeredGridView. You can also apply internal padding to the grid that doesn’t affect the header and footer views. An example view using these options is shown below. On our search results screen we use a full width header and add a little extra horizontal grid padding for 10 inch tablets.

Android Search

Into the Real World

During the development process we fine-tuned the grid’s performance using a variety of real world Android devices available in the Etsy Device Lab. When we released the new Etsy for Android app at the end of November, the AndroidStaggeredGrid was used throughout. Post launch we monitored and fixed some lingering bugs found with the aid of the awesome crash reporting tool Crashlytics.

We decided to open source the AndroidStaggeredGrid: a robust, well tested and real world UI component for the Android community to use. It’s available on GitHub or via Maven, and we are accepting pull requests.

Finally, a friendly reminder that the bright folks at Etsy mobile are hiring.

You can follow Deniz on Twitter at @denizmveli.

3 Comments

Migrating to Chef 11

Posted by on October 16, 2013 / 8 Comments

Configuration management is critical to maintaining a stable infrastructure. It helps ensure systems and software are configured in a consistent and deterministic way. For configuration management we use Chef.  Keeping Chef up-to-date means we can take advantage of new features and improvements. Several months ago, we upgraded from Chef 10 to Chef 11 and we wanted to share our experiences.

Prep

We started by setting up a new Chef server running version 11.6.0.  This was used to validate our Chef backups and perform testing across our nodes.  The general plan was to upgrade the nodes to Chef 11, then point them at the new Chef 11 server when we were confident that we had addressed any issues.  The first order of business: testing backups.  We’ve written our own backup and restore scripts and we wanted to be sure they’d still work under Chef 11.  Also, these scripts would come in handy to help us quickly iterate during break/fix cycles and keep the Chef 10 and Chef 11 servers in sync.  Given that we can have up to 70 Chef developers hacking on cookbooks, staying in sync during testing was crucial to avoiding time lost to troubleshooting issues related to cookbook drift.

Once the backup and restore scripts were validated, we reviewed the known breaking changes present in Chef 11.  We didn’t need much in the way of fixes other than a few attribute precedence issues and updating our knife-lastrun handler to use run_context.loaded_recipes instead of node.run_state().

Unforeseen Breaking Changes

After addressing the known breaking changes, we moved on to testing classes of nodes one at a time.  For example, we upgraded a single API node to Chef 11, validated Chef ran cleanly against the Chef 10 server, then proceeded to upgrade the entire API cluster and monitor it before moving on to another cluster.  In the case of the API cluster, we found an unknown breaking change that prevented those nodes from forwarding their logs to our log aggregation hosts.  This episode initially presented a bit of a boondoggle and warrants a little attention as it may help others during their upgrade.

The recipe we use to configure syslog-ng sets several node attributes, for various bits and bobs.  The following line in our cookbook is where all the fun started:

if !node.default[:syslog][:items].empty?

That statement evaluated to false on the API nodes running Chef 11 and resulted in a vanilla syslog-ng.conf file that didn’t direct the service to forward any logs.  Thinking that we could reference those nested attributes via the :default symbol, we updated the cookbook.  The Chef 11 nodes were content but all of the Chef 10 nodes were failing to converge because of the change.  It turns out that accessing default attributes via the node.default() method and node[:default] symbol are not equivalent.  To work around this, we updated the recipe to check for Chef 11 or Chef 10 behavior and assign our variables accordingly.  See below for an example illustrating this:

if node[:syslog].respond_to?(:has_key?)
    # Chef 11
    group = node[:syslog][:group] || raise("Missing group!")
    items = node[:syslog][:items]
else
    # Chef 10
    group = node.default[:syslog][:group] || raise("Missing group!")
    items = node.default[:syslog][:items]
end

In Chef 11, the :syslog symbol points to the key in the attribute namespace (it’s an ImmutableHash object) we need and responds to the .has_key?() method; in that case, we pull in the needed attributes Chef 11-style.  If the client is Chef 10, that test fails and we pull in the attributes using the .default() method.

Migration

Once we had upgraded all of our nodes and addressed any issues, it was time to migrate to the Chef 11 server.  To be certain that we could recreate the build and that our Chef 11 server cookbooks were in good shape, we rebuilt the Chef 11 server before proceeding.  Since we use a CNAME record to refer to our Chef server in the nodes’ client.rb config file, we thought that we could simply update our internal DNS systems and break for an early beer.  To be certain, however, we ran a few tests by pointing a node at the FQDN of the new Chef server.  It failed its Chef run.

Chef 10, by default, communicates to the server via HTTP; Chef 11 uses HTTPS.  In general, Chef 11 Server redirects the Chef 11 clients attempting to use HTTP to HTTPS.  However, this breaks down when the client requests cookbook versions from the server.  The client receives an HTTP 405 response.  The reason for this is that the client sends a POST to the following API endpoint to determine which versions of the cookbooks from its run_list need to be downloaded:

/environments/production/cookbook_versions

If Chef is communicating via HTTP, the POST request is redirected to use HTTPS.  No big deal, right?  Well, RFC 2616 is pretty clear that when a request is redirected, “[t]he action required MAY be carried out by the user agent without interaction with the user if and only if the method used in the second request is GET…”  When the Chef 11 client attempts to hit the /environments/cookbook_versions endpoint via GET, Chef 11 Server will respond with an HTTP 405 as it only allows POST requests to that resource.

The fix was to update all of our nodes’ client configuration files to use HTTPS to communicate with the Chef Server.  dsh (distributed shell) made this step easy.

Just before we finalized the configuration update, we put a freeze on all Chef development and used our backup and restore scripts to populate the new Chef 11 server with all the Chef objects (nodes, clients, cookbooks, data bags, etc) from the Chef 10 server.  After validating the restore operation, we completed the client configuration updates and shut down all Chef-related services on the Chef 10 server.  Our nodes happily picked up where they’d left off and continued to converge on subsequent Chef runs.

Post-migration

Following the migration, we found two issues with chef-client that required deep dives to understand, and correct, what was happening.  First, we had a few nodes whose chef-client processes were exhausting all available memory.  Initially, we switched to running chef-client in forking mode.  Doing so mitigated this issue to an extent (as the forked child released its allocated memory when it completed and was reaped) but we were still seeing an unusual memory utilization pattern.  Those nodes were running a recipe that included nested searches for nodes.  Instead of returning the node names and searching on those, we were returning whole node objects.  For a long-running chef-client process, this continued to consume available memory.  Once we corrected that issue, memory utilization fell down to acceptable levels.

See the following screenshot illustrating the memory consumption for one of these nodes immediately following the migration and after we updated the recipe to return references to the objects instead:

deploy_host_chef_nested_searches_mem_util

Here’s an example of the code in the recipe that created our memory monster:

# find nodes by role, the naughty, memory hungry way
roles = search(:role, '*:*')    # NAUGHTY
roles.each do |r|
  nodes_dev = search(:node, "role:#{r.name} AND fqdn:*dev.*")    # HUNGRY
  template "/etc/xanadu/#{r.name.downcase}.cfg" do
  ...
  variables(
    :nodes => nodes_dev
  )
  end
end

Here’s the same code example, returning object references instead:

# find nodes by role, the community-friendly, energy-conscious way
search(:role, '*:*') do |r|
  fqdns = []
  search(:node, "role:#{r.name} AND fqdn:*dev.*") do |n|
    fqdns << n.fqdn
  end
  template "/etc/xanadu/#{r.name.downcase}.cfg" do
    ...
    variables(
      :nodes => fqdns
    )
  end
end

Second, we found an issue where, in cases where chef-client would fail to connect to the server, it would leave behind its PID file, preventing future instances of chef-client from starting.  This has been fixed in version 11.6.0 of chef-client.

Conclusion

Despite running into a few issues following the upgrade, thorough testing and Opscode’s documented breaking changes helped make our migration fairly smooth. Further, the improvements made in Chef 11 have helped us improve our cookbooks. Finally, because our configuration management system is updated, we can confidently focus our attention on other issues.

8 Comments

September 2013 Site Performance Report

Posted by on October 14, 2013 / 3 Comments

As we enter the fourth quarter of 2013, it’s time for another site performance report about how we did in Q3. Our last report highlighted the big performance boost we saw from upgrading to PHP 5.4, and this report will examine a general front-end slowdown that we saw over the last few months.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, September 18th:

On the server side we saw a modest decrease on most pages, with some pages (e.g. the profile page) seeing a slight increase in load time. As we have mentioned in past reports, we are not overly worried about the performance of our core pages, so the main thing we are looking for here is to avoid a regression. We managed to achieve this goal, and bought ourselves a little extra time on a few pages through some minor code changes. This section isn’t very exciting, but in this case no news is good news.

Synthetic Front-end Performance

The news here is a mixed bag. As usual, we are using our private instance of WebPagetest to get synthetic measurements of front-end load time. We use a DSL connection and test with IE8, IE9, Firefox, and Chrome. Here is the data, and all of the numbers are medians over a 24 hour period:

On the plus side, we saw a pretty significant decrease in start render time almost across the board, but document complete time increased everywhere, and increased dramatically on the listing page. Both of these global effects can be explained by rolling out deferred JavaScript everywhere – something we mentioned back in our June report. At that time we had only done it on the shop page, and since then we have put it on all pages by default. This explains the decrease in start render time on all pages except for the shop page. The profile page also had a slight uptick in start render time, and we are planning on investigating that.

One of the side effects of deferring JavaScript is that document complete time tends to rise, especially in IE8. This is an acceptable tradeoff for us, since we care more about optimizing start render, and getting content in front of the user as quickly as possible. We’re also not convinced that a rise in document complete time will have a negative impact on business metrics, and we are running tests to figure that out now.

The massive increase in document complete time on the listing page is due to the rollout of a page redesign, which is much heavier and includes a large number of web fonts. We are currently setting up a test to measure the impact of web fonts on customer engagement, and looking for ways to reduce page weight on the listing page. While document complete isn’t a perfect metric, 8 seconds is extremely high, so this bears looking into. That said, we A/B tested engagement on the old page and the new, and all of the business metrics we monitor are dramatically better with the new version of the listing page. This puts further doubt on the impact of document complete on customer behavior, and illustrates that performance is not the only thing influencing engagement – design and usability obviously play a big role.

Real User Front-end Performance

These numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

The effect here mirrors what we saw on the synthetic side – a general upward trend, with a larger spike on the listing page. These numbers are for the “page load” event in mPulse, which is effectively the onload event. As Steve Souders and others have pointed out, onload is not a great metric, so we are looking for better numbers to measure on the real user side of things. Unfortunately there isn’t a clear replacement at this point, so we are stuck with onload for now.

Conclusion

Things continue to look good on the server side, but we are slipping on the front-end. Partly this has to do with imperfect measurement tools, and partly it has to do with an upward trend in page weight that is occurring all across the web – and Etsy is no exception. Retina images, web fonts, responsive CSS, new JavaScript libraries, and every other byte of content that we serve continue to provide challenges for the performance team. As we continue to get hard data on how much page weight impacts performance (at least on mobile), we can make a more compelling case for justifying every byte that we serve.

You can follow Jonathan on Twitter at @jonathanklein

3 Comments

Nagios, Sleep Data, and You

Posted by on September 28, 2013 / 15 Comments

Gettin’ Shuteye

Ian Malpass once commented that “[i]f Engineering at Etsy has a religion, it’s the Church of Graphs.”  And I believe!  Before I lay me down to sleep during an on-call shift, I say a little prayer that should something break, there’s a graph somewhere I can reference.  Lately, a few of us in Operations have begun tracking our sleep data via Jawbone UPs.  After a few months of this we got to wondering how this information could be useful, in the context of Operations.  Sleep is important.  And being on call can lead to interrupted sleep.  Even worse, after being woken up, the amount of time it takes to return to sleep varies by person and situation.  So, we thought, “why not graph the effect of being on call against our sleep data?”

Gathering and Visualizing Data

We already visualize code deploys against the myriad graphs we generate, to lend context to whatever we’re measuring.  We use Nagios to alert us to system and service issues.  Since Nagios writes consistent entries to a log file, it was a simple matter to write a Logster parser to ship metrics to Graphite when a host or service event pages out to an operations engineer.  Those data points can then be displayed as “deploy lines” against our sleep data.

For the sleep data we used, and extended, Aaron Parecki’s ‘jawbone-up‘ gem to gather sleep data (summary and detail information) via Jon Cowie’s handy ‘jawboneup_to_graphite‘ script on a daily basis.  Those data are then displayed on personal dashboards (using Etsy’s Dashboard project).

Results

So far, we’ve only just begun to collect and display this information.  As we learn more, we’ll be certain to share our findings.  In the meantime, here are examples from recent on-call shifts.

nagios_deploy_lines_prototype

This engineer appeared to get some sleep!

laurie_on_call_sleep_detail

Here, the engineer was alerted to a service in the critical state in the wee hours of the morning. From this graph we can tell that he was able to address the issue fairly quickly, and most importantly, get back to sleep fast.

NOTE:  Jawbone recently opened up their API.  Join the party and help build awesome apps and tooling around this device!

15 Comments

LXC – Automating Containers aka Virtual Madness (Part 2)

Posted by on September 23, 2013 / 5 Comments

Presenting Virtual Madness!

This is part 2 of our LXC blog post. If you missed the first half you can read it here.

We already have much tooling around making it easy to create a virtual machine for each developer, so it made sense to build our LXC virtualization tools into the same interface.

blog1

As shown above, our main page gives a quick look into all the LXC Containers that are running and the Jenkins instances to which they are attached. The containers are known colloquially as the Bobs, a reference to Bob the Builder. The interface also separates the Bobs by physical disk for ease of detecting if any of the virtual hosts’ disks are overloaded.

Creation of a new Bob follows our one button deploy principle we strive for at Etsy:

blog2

The next numerically available hostname is automatically populated into the form by referencing all defined hosts in Chef and finding the first gap in numbering. The first physical host with enough free disk space is pre-selected from the drop-down and is determined according to the fact that we can fit 14 Bobs on each physical host (more on that later). Simply clicking the “Make it” button will kick off the process of creating an LXC container, and its progress is streamed via websockets to the browser in real-time.

The process of creating a new container is roughly as follows. Which of the disks on any given physical host has room for a new Bob is determined by a simple rule: the first disk holds a maximum of four Bobs, and the second and third disk hold a maximum of five each. Only four are allowed on the first disk in order to save room for the host OS and a base LVM volume which we clone for each Bob. The first numerically available IP address in any of our virtual host subnets is allocated to the Bob by iterating through reverse DNS lookups on the range until an NXDOMAIN is returned. nsupdate is used to dynamically create the DNS entries for the new container. An empty LVM volume is then created by using lvcreate and mkfs.ext3, it is mounted, and rsync is used to clone the base volume to the empty volume.

The base volume is created in advance by cloning the physical host’s root filesystem (which is configured almost exactly the same way as the Bobs) and then filtering out some unneeded items. /dev is fixed using mknod because a simple rsync will render it useless. Sed is used to fix all the network settings in /etc/sysconfig. Unneeded services are disabled by creating a small bash script inside the volume and then chrooting to run it.

Once the base volume is copied into our new volume, the Chef authorization key is removed so that the container is forced to automatically reregister with our Chef server as a new node with its newly specified hostname. Our default route to the gateway is added based on subnet and datacenter. An LXC config file is then created using the IP address and other options. lxc.cgroup.cpuset.cpus is set to 0 as our containers are not pinned to a specific set of CPU cores. This allows each executor to use as many CPU cores as can be allocated at any given time.

Finally, we bootstrap the node using Chef, bringing the node up to date with all our current packages and configurations. This is necessary because the base LVM volume from which the new container is cloned is created in advance and often has stale configurations.

There is also a batch UI which executes the exact same process as above but allows for the creation of Bobs in bulk:

blog3

Once Bobs are created, they are ready to be attached to Jenkins instances and given executor roles:

blog4

This interface allows for the choice of instance and label (executor), and the number of Bobs to attach. Unattached Bobs are surfaced by use of the Jenkins API. After iterating through all unattached Bobs, taking into account the rule of only one heavy executor per disk, a groovy script is used to attach the Bob to Jenkins. The Jenkins API did not provide this functionality, but it does allow for the execution of groovy scripts, which we leveraged to automate this.

For our final bit of automation around this process, we created a one-button interface to wipe and rebuild the base LVM container on all physical hosts. This is necessary when the LVM containers fall very far out of date and start to cause issues bringing Bobs up to date using Chef. This interface simply performs the base container creation task described above across any number of physical hosts.

blog5

All in all, this virtual madness allowed us to turn our many manual processes of setting up LXC containers into simple one-button deploys. How do you handle this kind of stuff? Let us know in the comments!

Co-written by Jayson Paul, Patrick McDonnell and Nassim Kammah.

You can follow Jayson on Twitter at @jaysonmpaul
You can follow Patrick on Twitter at @mcdonnps

5 Comments

LXC – Running 14,000 tests per day and beyond! (Part 1)

Posted by on September 23, 2013 / 9 Comments

Continuous Integration at Etsy

As Etsy continues to grow and hire more developers, we have faced the continuous integration scaling challenge of how to execute multiple concurrent test suites without slowing the pace of our deploy queue. With a deployment rate of up to 65 deploys / day and a total of 30 test suites (unit tests, integration tests, functional tests, smokers…) that run for every deploy, this means running the test suites 1950 times a day.

Our core philosophy is to work from the master branch as much as possible; while developers can run the test suites on their own virtual machines, it breaks the principle of testing in a clone of a production environment. Since everyone has root access and can install anything on their VMs, we can’t ensure a sane, standard test environment on developer VMs. Yet, they must ensure their code will not break the tests before they commit, or they will block the deploy queue. To this end, we developed a small library called “try” which sends a diff of the developer’s work to a Jenkins server that in turns applies it against the latest copy of the master branch and runs all the test suites (checkout TryLib on github). Unfortunately, providing this feedback loop to every developer comes at a cost. With up to 10 developers “trying” their changes simultaneously, we now need to run the test suites an additional 13700 times a day (we average 515 try runs a day, with different test suite configurations).

Last but not least, the test feedback loop ought to be short to provide any value. Our current “SLA” for running all test suites is 5 minutes. Following the “divide and concur” strategy and with the help of the master project Jenkins plugin, we are able to run the test suites in parallel. Our initial setup had to maintain multiple executors per box, which caused connection issues, and required preventing multiple executors from attempting to access the same Jenkins workspace at once.

Workload Considerations

While parallelization was initially a major victory, it soon became problematic.  Our workload consists of two different classes of tests: one class constrained by CPU (“lightweight executors”, which run our unit tests), and the other bound to disk I/O (“heavyweight executors”, which run our integration tests).  Prior to virtualization, we would find that if multiple heavyweight jobs ran on a single host, the host would slow to crawl, as its disk would be hammered.

We also had issues with workspace contention when executing multiple jobs on a single host, and this need for filesystem isolation, in combination with resource contention and issues within Jenkins concerning too many parallel executors on a single host, led us toward the decision that virtualization was the best path forward.  We were, however, concerned with the logistics of managing potentially hundreds of VMs using our standard KVM/QEMU/libvirt environment.

Virtualization with LXC

Enter LXC, a technology wonderfully suited to virtualizing homogenized workloads.  Knowing that the majority of our executors are constrained on CPU, we realized that using a virtualization technology for which we would need to be conscious about resource allocation did not make sense for our needs.  Instead of concerning ourselves with processor pinning, RAM provisioning, disk imaging, and managing multiple running kernels, LXC allows us to shove a bunch of jailed processes in LVM containers and call it a day.

While the workings of LXC are outside the scope of this post, here is a little bit of information about our specific configuration.  Our physical hosts are 2U Supermicro chassis containing four blades each.  Each blade has three attached SSDs for a total of 12 disks per chassis.  In our case, disk I/O is a key constraint; if we run two heavyweight jobs accessing a single disk simultaneously, the run time doubles for each job.  In order to maintain our 5 minute run time SLA, we ensure that only one heavyweight executor runs on each disk at any given time.

Previously, we had many executors running on each physical host, but with LXC, we have decided that each container (LXC’s terminology for a virtualized instance) shall host a single executor.  Maintaining a 1:1 container:executor mapping alleviates the connection issues we saw prior to virtualization within Jenkins while trying to maintain a large number of executors per host, and we never have to worry about multiple executors attempting to access the same Jenkins workspace at once.  It also allows for easy provisioning using the Virtual Madness interface discussed below.

Continue to Part 2

Co-written by Nassim Kammah, Patrick McDonnell and Jayson Paul. Be sure to stay tuned for Part 2 of this blog post where Jayson will be discussing automating this process of automating LXC management.

You can follow Nassim on Twitter at @kepioo 
You can follow Patrick on Twitter at @mcdonnps
You can follow Jayson on Twitter at @jaysonmpaul

9 Comments

Improving Etsy for iOS with server-based logging

Posted by on September 11, 2013 / 1 Comment

One of the major challenges the Native Apps group faces at Etsy is maintaining and debugging a large codebase with a small team. Android and iOS are less mature platforms at Etsy and correspondingly lack some of the development tools and safety net that our fellow engineers have built up on the web side. Part of what the apps team does is build our own platform-specific tools. This is a case study for how we applied Etsy’s log-everything ethos to a particular class of bugs in Etsy for iOS.

Collection errors in Cocoa Touch

Crashing is one of the worst things an app can do. Cocoa has a number of patterns that encourage silent failures over crashing, most notably nil-messaging. But there are some parts of Cocoa that have fail-fast behavior instead. The most common of these cases involve the collection classes, which throw exceptions for cases like array out-of-bound errors or setting a nil keys in a dictionary. Server hiccups can cause problems when we use collections if we’re not careful, since we interpret empty fields returned from the Etsy API as nil objects. These issues were especially common when we were first developing our app, as the server API was constantly changing, but they can also be caused by all sorts of other server errors.

We could solve this problem by wrapping every array or dictionary access in an if statement to check for these cases, but that would be so much more verbose it seems likely to cause more bugs than it catches. Instead, we abstracted this pattern out into a series of wrapper methods and added them as a Safe category on each collection class, such as -[NSArray(Safe) safeObjectAtIndex:] and -[NSMutableDictionary(Safe) safeSetObject:forKey:]. These methods handle those exceptional cases, either taking no action or returning nil as appropriate. This way, if we have unexpected missing data, we might do something like display an empty label, but at least the app remains stable. Users can refresh or go back to a screen with good data.

Over time, we started using these methods all over the code base. They supported several convenient patterns, such as lazily loading cached objects from an array:

for (SomeClass * object in myArray) {
    NSUInteger index = [myArray indexOfObject:object];
    //Returns nil when out-of-bounds
    UIView * view = [cachedViews safeObjectAtIndex:index];
    if (!view) {
        view = //...

However, when Apple added subscripted collection indexing in Clang 3.4, we moved away from the Safe methods wherever it looked okay. We wanted our code to be more idiomatic as well as aesthetically pleasing; array[0] is much easier to scan than [array safeObjectAtIndex:0] and more consistent with library code. What we didn’t realize is that our liberal use of these Safe categories had been hiding logic bugs: benign-looking changes started causing app instability. For example, we discovered an NSNumberFormatter that had always returned nil due to bad input. When we switched to subscripting syntax, that minor failure turned into a serious crash.

Safer refactoring with Assertions and Expectations

In a subsequent postmortem, we realized that the intent of our code needed to be explicit. It was impossible to tell which Safe calls were necessary, which were simply convenient, and which were hiding bugs. Removing the category entirely wasn’t possible, since that would expose all the crashes we intended to fix in the first place.

Ultimately we came up with a multi-part plan to rationalize our use of these methods:

  1. Split our Safe methods into two variants. These variants have identical user visible behavior, but different intended semantics:
    1. A variant where we explicitly expect the out-of-bounds case and handle it by returning nil, ignoring the inputs, or whatever is necessary in that case.
    2. A variant where we don’t expect the out-of-bounds case and seeing it means something has gone wrong.
  2. Log the cases where something is wrong and upload them to our server-based logging system.
  3. Use the aggregated logs to identify spots in our code base where we’re catching potentially unexpected behavior.

The first part is simple. We added a set of methods, such as -[NSArray(Safe) objectOrNilAtIndex:], that explicitly indicate that nil is an acceptable value. This is useful for caching patterns and setting or getting optional values:

- (id) objectOrNilAtIndex:(int)index {
    if (index >= [self count])
        return nil;
    return [self objectAtIndex:index];
}

We kept our original “Safe” methods as the second variant, but modified to do the necessary logging. For that logging, enter our new assertion macros: EtsyAssert and EtsyExpect. The Etsy logging system has two main logging levels, ERROR and INFO, to separate critical error logs (that should be immediately investigated) from information (that we just want archived).  EtsyAssert corresponds to ERROR, while EtsyExpect corresponds to INFO.

//EtsyAssert is an alias for NSAssert in debug builds.
//In release builds, failures will log an error with a stack trace to the server.

#ifdef NS_BLOCK_ASSERTIONS
    #define EtsyAssert(condition, description, ...)
        if (!(condition)) {
            SERVER_LOG_ERROR_WITH_TRACE(@"EtsyAssert", description, ##__VA_ARGS__);
        }
#else
    #define EtsyAssert(condition, description, ...)
        NSAssert(condition, (description), ##__VA_ARGS__)
#endif

//EtsyExpect will complain if an expectation fails, but won't ever throw an exception.
#define EtsyExpect(condition, description, ...)
    if (!(condition)) {
        SERVER_LOG_INFO(@"EtsyExpect", description, ##__VA_ARGS__);
    }
#endif

EtsyAssert is a wrapper for NSAssert. It evaluates an expression and, when assertions are enabled, throws an exception if that expression evaluates to false. When assertions are disabled (typical in Release builds), it logs an error-level message. EtsyExpect is a little different: it indicates that we expect the code to work a particular way, and if it doesn’t, an error may (or may not) have occurred. We added EtsyExpect clauses to all of our old Safe methods. Since we were dealing with a potentially large number of existing minor issues, we used EtsyExpect so as not to interrupt regular developer workflows.

Let’s take a look at EtsyExpect in action:

- (id)safeObjectAtIndex:(int)index {
    EtsyExpect(index < [self count],
      @"Retrieving object at index %i from array of length %i", index, [self count]);
    return [self objectOrNilAtIndex:index];
}

Server-based logging for mobile clients

What’s makes EtsyExpect distinctly useful from the built-in Cocoa logging operations is the server-side logging. SERVER_LOG_INFO and SERVER_LOG_ERROR are macro wrappers around our logging framework:

#define SERVER_LOG_ERROR(ns, description, ...) 
    [EtsyAnalytics logError:[NSString stringWithFormat:description, ##__VA_ARGS__] 
                  namespace:ns 
                        line:__LINE__ 
                    function:__PRETTY_FUNCTION__ 
                  attributes:nil]

The logError:namespace:line:function:attributes method piggybacks on our existing native analytics framework. The EtsyAnalytics class sends zipped analytics data and log messages to an API server. Analytics data is ultimately used by our Big Data stack, but these log messages are printed directly to a file. From there they can be indexed, grepped, splunked, or watched in (almost) real time, like any other log. These logs look like:

api02 : [Fri Aug 09 21:25:43 2013] [iPhone 5 6.1.2/2.7 rv:2.7.0] [redacted UDID] [EtsyExpect] [-[NSArray(Safe) safeObjectAtIndex:]+401] Retrieving object at index 5 from array of length 5 (called by 1 Etsy 0x0029dcc1 Etsy + 2600129)

What makes this powerful is the ability to see issues quickly. In real life conditions, users hit edge cases that are rare or difficult to test. By aggregating errors sampled from 1% of app users, we can see where EtsyExpect and EtsyAssert fail. That increased visibility allows us to pick off, one by one, the bugs that our Safe methods were hiding. By leveraging this data, we can be sure of which cases are safe to change.

As an example: the log message above was a small logic error in the UIViewController subclass we use to display a listing. When the user swiped through a listing’s images, we preloaded the next image – without checking whether or not more images were available. App users were doing this thousands of times a day, but developers may have only encountered the issue a handful of times (if at all). While the bug was ultimately harmless, logging made it visible, so we were able to clean up this code with very little risk.

The future of logging

Moving a few logs out of the Xcode console and onto the Etsy servers turns out to be a very powerful tool. We’re still exploring what else we can do with it: for example, it’s also been useful for diagnosing invalid HTTP requests sent by the app. App logs can be correlated with server logs, so we can track certain errors from the client back to the server. Even  watching the logs in splunk helps us see what bugs we’ve fixed in the last version, and what we may have caused. Ultimately, by making Etsy for iOS a little bit like our web stack, we’ve found a way to make it more robust.

By the way, we’re hiring!

1 Comment

Etsy’s Device Lab

Posted by on August 9, 2013 / 9 Comments

At the end of 2012, Etsy CEO Chad Dickerson noted that mobile visits grew 244% year over year and represented more than 25% of all traffic. Halfway through 2013, this metric continues to steadily increase, with mobile web and app traffic accounting for about 40% of visits to Etsy.

As this traffic grows, our designers and developers need to be able to test how Etsy works on more devices, screen sizes, and operating systems. The Mobile and Corporate IT teams built a mobile device testing lab to help with this. Originally a gray cabinet with devices, now a handcrafted shelf with labels and power, we’ve begun encouraging product development teams to come and use it!

We’ve learned a ton so far from building this lab, and we’re eager to share these lessons with others as more companies consider creating their own testing suite.

Why build a testing lab?

team-testingMany designers and developers at Etsy have iPhones or very recent Android smartphones. Mobile web work was historically tested on an employee’s phone or in their desktop browser using spoofed user agents, which is great for day-to-day development, but we need to test out work on a wider range of real, physical devices that represent what our visitors are using when they arrive at Etsy.com. Small bugs and edge cases were being missed during development, and we weren’t necessarily testing on devices that our members are actually using.

We decided to build the device lab with devices that drive the most traffic to the site, as well as some older devices that we know can be problematic to develop for (particularly older Androids and Blackberries). There are a ton of articles on how to choose devices for your lab; see the end of this article for links!

“Testing your designs in the device lab is great. I had no idea how many strange aspect ratios were out there on Android devices. Being an Apple only mobile device user, the device lab was an invaluable resource for testing my designs across unfamiliar operating systems and hardware.” – Jason Huff, Designer

This has helped our teams not just test new projects, but also go back and find things to fix in already-shipped projects:

“The mobile device lab really is a phenomenal resource. While it’s easy to check features on your own phone or tablet, it’s easy to forget how many other devices are out there and being used by our members. When we went up to the lab for the first time, our team discovered a number of device specific scenarios that we hadn’t come across during our previous testing, and I’m so glad these were surfaced” – Triona Fritsch, Product Manager

Unexpectedly, the device lab has also been a brainstorming opportunity for teams who are new to mobile development. As people go through and test their existing projects, they also see how they could have reframed their approach from the beginning, and they’re inspired to do more mobile-first work.

“[The device lab] also gave us some new insight into scenarios that we should consider during the initial design phases for future mobile projects.” – Triona Fritsch, Product Manager

Setup

2013-08-06 13.05.28 copyWe currently have 34 devices on this custom-made bookshelf (made by Reade Bryan who works on Etsy’s Facilities team). See the end of this article for more resources on selecting devices. Here’s our breakdown:

The bookshelf was designed to have different-height openings for our range of device sizes. Each bookshelf has a lip on the front so that we have the option of running cables behind it and putting labels on the front of it.

The bottom shelf holds all our power setup as well as custom built crates from Etsy seller EmmersonWoodworks. These crates hold cables for testers to borrow if they’d like to take devices back to their desk, as well as screen wipes, labels and other miscellaneous items that we need nearby.

standsWe purchased a variety of stands from Etsy sellers, including PhoneStandz, DonorStand, woodworksRD and Fenton Creek WoodWorks. The stands ranged in material, size, and slot width – some older phones need slightly bigger stands since they’re thicker than newer phones. We typically stand phones upside-down in their stands to charge because the cables rarely sit nicely in the stands while charging.

We also use library cards for signing out devices that leave the room, which we purchased from jardindepapier. The library cards are color-coded by operating system (as is the washi tape from InTheClear we used to hold them down, label the front of the shelves and the back of each device). On the front of the library card pockets we include:

This way, testers can make sure they’re checking out a range of devices, screen sizes, and OS versions. We also developed a getting started guide that outlines device traffic, known quirks like overlays and fixed elements, and how to get set up with Adobe Edge Inspect.

Each device gets a cable drop to hold its cable in place so they don’t fall behind the shelves, a library card with its information, Adobe Edge Inspect installed, an asset tag, and a stand that fits it. We also install a background image that reminds testers of our two device lab rules:

  1. Please sign out devices
  2. Don’t upgrade the OS or apps

It does take some time to get each device set up in the lab, but the organization is really worth it to keep it easy for testers to use a device, sign it out, and plug it back in.

Major Lessons (So Far)

Power

Power has been the biggest challenge for us. As a Certified B Corporation, we want to track energy consumption and make sure we’ve set up the device lab in an optimal way. We’re measuring the amount of power we use, and we’ve also set the devices up on a power strip with a timer so they only charge for a few hours each day. We’re still finding the right balance.

Power was further complicated when we realized that most USB hubs are not set up for devices with lots of power draw (like iPads). We had to learn the hard way that when you plug too many devices into a hub at once, the whole hub can stop working. We are now using a USB hub that can handle up to 32 devices, which maxes out at 16 Amps (at 500mA per port). There is a safety protection device within the circuit board that prevents a higher charge going to each port. This is ideal for lower powered devices (most of the ones we have in the lab) but is not suitable for larger, more power hungry devices like the iPad or other tablets. Tablets generally require a higher charging level so they will charge very slowly, or not power the device at all. This is why we have some of the lab’s devices running directly into a power strip. The other nice thing about this hub is a fault indicator light which helps identify problematic devices; devices which consume excess current are also disabled, leaving non-faulty devices charging.

Label all the things

iphone4Any web content editor will tell you that people don’t read, they scan. When a designer or developer comes to use the device lab, we want to make it so easy to use that they don’t have to think. Need to grab a cable to take back to your desk? Try the crate that says “BORROW THESE”. Need an iPhone 4? We’ve labeled the front of the bookcase where it lives and the back of the device so that it’s easy to match up later. Need to find the Blackberries among the devices? Look for the blue library cards (and blue washi tape to hold the cards down). The color-coding of library cards and tape was intentional to easily group Androids (green), iOS devices (manila), Kindles (yellow), Windows (pink), Blackberries (blue) and a Firefox phone (orange).

We wrapped washi tape around each of the USB cables to borrow so that, should they walk away to someone’s desk and forgotten, they’ll be relatively easy to spot later. We’ve also added a sliver of washi tape on or near the power button for each device so that it’s easy to find — this tends to be the most challenging part of using the device lab!

Testing on VMs

We have an awesome Security Team at Etsy who has been invaluable during the device lab setup. Recently, they helped install security certificates across devices to make it easier to test on VMs. Previously, testers would have to individually accept certificates device-by-device, which really hampered testing and development, especially when they were trying to tackle a lot of testing at once with Adobe Edge Inspect.

Adobe Edge Inspect

bucky-testingSo far, this tool has been really handy for testing on lots of devices at once. We’re using a multi-device subscription (free Adobe Edge allows you to test on only one device at a time) and installed the Adobe Edge app on all Androids, Kindles and iOS devices in the lab. It’s really awesome to feel like a puppet master and control all of those devices at once, but what’s really handy is capturing screenshots from all those devices at the same time so you can check them out on your laptop and audit them later.

We’ve come a long way with our device testing lab, and I’m excited to keep iterating on it and making it even easier to test with. In the future, we’ll be adding more automated testing to the devices, easier tracking of devices (including signing them out), and more analysis of use. As mobile traffic continues to increase, we’re eager to make sure that it’s incredibly easy for designers and developers to test their work and experience what our members do when they use Etsy from a mobile device.

before-after

More reading:

You can follow Lara, engineering manager for the mobile web team, on Twitter at @laraswanson

9 Comments