Introducing nagios-herald

Posted by on June 6, 2014 / 3 Comments

Alert Design

Alert design is not a solved problem. And it interests me greatly.

What makes for a good alert? Which information is most relevant when a host or service is unavailable? While the answer to those, and other, questions depends on a number of factors (including what the check is monitoring, which systems and services are deemed critical, what defines good performance, etc.), at a minimum, alerts should contain some amount of appropriate context to aid an on-call engineer in diagnosing and resolving an alerting event.

When writing Nagios checks, I ask the following questions to help suss out what may be appropriate context:

On the last point, about automating work, I believe that computers can, and should, do as much work as possible for us before they have to wake us up. To that end, I’m excited to release nagios-herald today!

nagios-herald: Rub Some Context on It

nagios-herald was created from a desire to supplement an on-call engineer’s awareness of conditions surrounding a notifying event. In other words, if a computer is going to page me at 3AM, I expect it to do some work for me to help me understand what’s failing. At its core, nagios-herald is a Nagios notification script. The power, however, lies in its ability to add context to Nagios alerts via formatters.

One of the best examples of nagios-herald in action is comparing the difference between disk space alerts with and without context.

Disk Space Alert

disk_space_no_context

I’ve got a vague idea of which volume is problematic but I’d love to know more. For example, did disk space suddenly increase? Or did it grow gradually, only tipping the threshold as my head hit the pillow?

Disk Space Alert, Now *with* Context!

disk_space_with_context

In the example alert above, a stack bar clearly illustrates which volume the alert has fired on. It includes a Ganglia graph showing the gradual increase in disk storage over the last 24 hours. And the output of the df command is highlighted, helping me understand which threshold this check exceeded.

For more examples of nagios-herald adding context, see the example alerts page in the GitHub repo.

“I Have Great Ideas for Formatters!”

I’m willing to bet that at some point, you looked at a Nagios alert and thought to yourself, “Gee, I bet this would be more useful if it had a little more information in it…”  Guess what? Now it can! Clone the nagios-herald repo, write your own custom formatters, and configure nagios-herald to load them.

I look forward to feedback from the community and pull requests!

Ryan tweets at @Ryan_Frantz and blogs at ryanfrantz.com.

3 Comments

Q1 2014 Site Performance Report

Posted by on May 15, 2014 / 3 Comments

May flowers are blooming, and we’re bringing you the Q1 2014 Site Performance Report. There are two significant changes in this report: the synthetic numbers are from Catchpoint instead of WebPagetest, and we’re going to start labeling our reports by quarter instead of by month going forward.

The backend numbers for this report follow the trend from December 2013 – performance is slightly up across the board. The front-end numbers are slightly up as well, primarily due to experiments and redesigns. Let’s dive into the data!

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, April 23rd:

Server Side Performance

There was a small increase in both median and 95th percentile load times over the last three months across the board, with a larger jump on the homepage. We are currently running a few experiments on the homepage, one of which is significantly slower than other variants, which is bringing up the 95th percentile. While we understand that this may skew test results, we want to get preliminary results from the experiment before we spend engineering effort on optimizing this variant.

As for the small increases everywhere else, this has been a pattern over the last six months, and is largely due to new features adding a few milliseconds here and there, increased usage from other countries (translating the site has a performance cost), and overall added load on our infrastructure.  We expect to see a slow increase in load time for some period of time, followed by a significant dip as we upgrade or revamp pieces of our infrastructure that are suffering. As long as the increases aren’t massive this is a healthy oscillation, and optimizes for time spent on engineering tasks.

Synthetic Front-end Performance

Because of some implementation details with our private WebPagetest instance, the data we have for Q1 isn’t consistent and clean enough to provide a true comparison between the last report and this one.  The good news is that we also use Catchpoint to collect synthetic data, and we have data going back to well before the last report.  This enabled us to pull the data from mid-December and compare it to data from April, on the same days that we pulled the server side and RUM data.

Our Catchpoint tests are run with IE9 only, and they run from New York, London, Chicago, Seattle, and Miami every two hours.  The “Webpage Response” metric is defined as the time it took from the request being issued to receiving the last byte of the final element on the page.  Here is that data:

Synthetic Performance - Catchpoint

The increase on the homepage is somewhat expected due to the experiments we are running and the increase in the backend time. The search page also saw a large increase both Start Render and Webpage Response, but we are currently testing a completely revamped search results page, so this is also expected.  The listing page also had a modest jump in start render time, and again this is due to differences in experiments that were running in December vs. April.

Real User Front-end Performance

As always, these numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

Real User Monitoring

No big surprises here, we see the same bump on the homepage and the search results page that we did in the server side and synthetic numbers. Everything else is essentially neutral, and isn’t particularly exciting. In future reports we are going to consider breaking this data out by region, by mobile vs. desktop, or perhaps providing other percentiles outside of the median (which is the 50th percentile).

Conclusion

We are definitely in a stage of increasing backend load time and front-end volatility due to experiments and general feature development. The performance team has been spending the past few months focusing on some internal tools that we hope to open source soon, as well as running a number of experiments ourselves to try to find some large perf wins. You will be hearing more about these efforts in the coming months, and hopefully some of them will influence future performance reports!

Finally, our whole team will be at Velocity Santa Clara this coming June, and Lara, Seth, and I are all giving talks.  Feel free to stop us in the hallways and say hi!

3 Comments

Web Experimentation with New Visitors

Posted by on April 3, 2014 / 4 Comments

We strive to build Etsy with science, and therefore love how web experimentation and A/B testing help us drive our product development process. Several months ago we started a series of web experiments in order to improve Etsy’s homepage experience for first-time visitors. Testing against a specific population, like first-time visitors, allowed us to find issues and improve our variants without raising concerns in our community. This is how the page used to look for new visitors:

old-homepage

We established both qualitative and quantitative goals to measure improvements for the redesign. On the qualitative side, our main goal was to successfully communicate to new buyers that Etsy is a global marketplace made by people. On the quantitative side, we primarily cared about three metrics: bounce rate, conversion rate, and retention over time. Our aim was to reduce bounce rate (percentage of visits who leave the site after viewing the homepage) without affecting conversion rate (proportion of visits that resulted in a purchase) and visit frequency. After conducting user surveys, usability tests, and analyzing our target web metrics, we have finally reached those goals and launched a better homepage for new visitors. Here’s what the new homepage looks like:

new-homepage

Bucketing New Visitors

This series of web experiments marked the first time at Etsy where we tried to consistently run an experiment only for first-time visitors over a period of time. While identifying a new visitor is relatively straightforward, the logic to present that user with the same experience on subsequent visits is something less trivial.

Bucketing a Visitor

At Etsy we use our open source Feature API for A/B testing. Every visitor is assigned a unique ID when they arrive to the website for the first time. In order to determine in which bucket of a test the visitor belongs to, we generate a deterministic hash using the visitor’s unique ID and the experiment identifier. The main advantage of using this hash for bucketing is that we don’t have to worry about creating or managing multiple cookies every time we bucket a visitor into an experiment.

Identifying New Visitors

One simple way to identify a new visitor is by the absence of etsy.com cookies in the browser. On our first set of experiments we checked for the existence of the __utma cookie from Google Analytics, which we also used to define visits in our internal analytics stack.

Returning New Visitors

Before we define a returning new visitor, we need first to describe the concept of a visit. We use the Google Analytics visit definition, where a visit is a group of user interactions on our website within a given time frame. One visitor can produce multiple visits on the same day, or over the following days, weeks, or months. In a web experiment, the difference between a returning visitor and a returning new visitor is the relationship between the experiment start time and the visitor’s first landing time on the website. To put it simply, every visitor who landed on the website for the first time after the experiment start date will be treated as a new visitor, and will consistently see the same test variant on their first and subsequent visits.

As I mentioned before, we used the __utma cookie to identify visitors. One advantage of this cookie is that it tracks the first time a visitor landed on the website. Since we have access to the first visit start time and the experiment start time, we can determine if a visitor is eligible to see an experiment variant. In the following diagram we show two visitors and their relation with the experiment start time.

 visits-diagram

Feature API

We added the logic to compare a visitor’s first landing time against an experiment start time as part of our internal Feature API. This way it’s really simple to set up web experiments targeting new visitors. Here is an example of how we set up an experiment configuration and an API entry point.

Configuration Set-up:

$server_config['new_homepage'] => [
   'enabled' => 50,
   'eligibility' => [
       'first_visit' => [
           'after' => TIMESTAMP
       ]
   ]
];

API Entry Point:

if (Feature::isEnabled('new_homepage')) {
   $controller = new Homepage_Controller();
   $controller->renderNewHomepage();
}

Unforeseen Events

When we first started analyzing the test results, we found that more than 10% of the visitors in the experiment had first visit landing times prior to our experiment start day. This suggested that old, seasoned Etsy users were being bucketed into this experiment. After investigating, we were able to correlate those visits to a specific browser: Safari 4+. The visits were a result of the browser making requests to generate thumbnail images for the Top Sites feature. These type of requests are generated any time a user is on the browser, even without visiting Etsy. On the web analytics side, this created a visit with a homepage view followed by an exit event. Fortunately, Safari provides a way to identify these requests using the additional HTTP header “X-Purpose: preview”. Finally, after filtering these requests, we were able to correct this anomaly in our data. Below you can see the experiment’s bounce rates significantly decreased after getting rid of these automated visits.

bounces

Although verifying the existence of cookies to determine whether a visitor is new may seem trivial, it is hard to be completely certain that a visitor has never been to your website before based on this signal alone. One person can use multiple browsers and devices to view the same website: mobile, tablet, work or personal computer, or even borrow any other device from a friend. Here is when more deep analysis can come in handy, like filtering visits using attributes such as user registration and signed-in events.

Conclusions

We are confident that web experimentation with new visitors is a good way to collect unbiased results and to reduce product development concerns such as disrupting existing users’ experiences with experimental features. Overall, this approach allows us to drive change. Going forward, we will use what we learned from these experiments as we develop new iterations of the homepage for other subsets of our members. Now that all the preparatory work is done, we can ramp-up this experiment, for instance, to all signed-out visitors.

You can follow Diego on Twitter at @gofordiego

4 Comments

Responsive emails that really work

Posted by on March 13, 2014 / 16 Comments

If you’ve ever written an HTML email, you’ll know that the state of the art is like coding for the web 15 years ago: tables and inline styles are the go-to techniques, CSS support is laughably incomplete, and your options for layout have none of the flexibility that you get on the “real web”.

Just like everywhere online, more and more people are using mobile devices to read their email.  At Etsy, more than half of our email opens happen on a mobile device!  Our desktop-oriented, fixed-width designs are beautiful, but mobile readers aren’t getting the best experience.

We want our emails to provide a great experience for everyone, so we’re experimenting with new designs that work on every client: iPhone, iPad, Android, Gmail.com, Outlook, Yahoo Mail, and more.  But given the sorry state of CSS and HTML for email, how can we make an email look great in all those places?

Thanks to one well-informed blog commenter and tons of testing across many devices we’ve found a way to make HTML emails that work everywhere.  You get a mobile-oriented design on phones, a desktop layout on Gmail, and a responsive, fluid design for tablets and desktop clients.  It’s the best of all worlds—even on clients that don’t support media queries.

A New Scaffolding

I’m going to walk you through creating a simple design that showcases this new way of designing HTML emails.  It’s a two-column layout that wraps to a single column on mobile:

Layout

For modern browsers, this would be an easy layout to implement—frameworks like Bootstrap provide layouts like this right out of the box.  But the limitations of HTML email make even this simple layout a challenge.

Client Limitations

What limitations are we up against?

On to the Code

Let’s start with a simple HTML structure.

<html>
 <body>
   <table cellpadding=0 cellspacing=0><tr><td>
     <table cellpadding=0 cellspacing=0><tr><td>
       <div>
         <h1>Header</h1>
       </div>
       <div>
         <h2>Main Content</h2>
         <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec gravida sem dictum, iaculis magna ornare, dignissim elit.</p>
         <p>...</p>
       </div>
       <div>
         <h2>Sidebar</h2>
         <p>Donec tincidunt tincidunt nunc, eget pulvinar risus sodales eu.</p>
       </div>
       <div>
         <p>Footer</p>
       </div>
     </td></tr></table>
   </td></tr></table>
 </body>
</html>

It’s straightforward: a header and footer with two content areas between, main content and a sidebar.  No fancy tags, just divs and tables and paragraphs—we’re still partying like it’s 1999.  (As we apply styling, we’ll see why both wrapping tables are required.)

Initial Styling

Android is the least common denominator of CSS support, allowing only inline CSS in style attributes and ignoring all other styles.  So let’s add inline CSS that gives us a mobile-friendly layout of a fluid single column:

<html>
 <body style="margin: 0; padding: 0; background: #ccc;">
   <table cellpadding=0 cellspacing=0 style="width: 100%;"><tr><td style="padding: 12px 2%;">
     <table cellpadding=0 cellspacing=0 style="margin: 0 auto; background: #fff; width: 96%;"><tr><td style="padding: 12px 2%;">
     <div>
       <h1>Header</h1>
     </div>
     <div>
       <h2 style="margin-top: 0;">Main Content</h2>
       <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec gravida sem dictum, iaculis magna ornare, dignissim elit.</p>
       <p>...</p>
     </div>
     <div>
       <h2 style="margin-top: 0;">Sidebar</h2>
       <p>Donec tincidunt tincidunt nunc, eget pulvinar risus sodales eu.</p>
     </div>
     <div style="border-top: solid 1px #ccc;">
       <p>Footer</p>
     </div>
     </td></tr></table>
   </td></tr></table>
 </body>
</html>

It honestly doesn’t look that different from the unstyled HTML (but the groundwork is there for your beautiful ideas!).  The table-within-a-table wrapping all the content lets us have our content area in a colored background, with a small (2%) gutter on each side.  Don’t forget the cellspacing and cellpadding attributes, too, or you’ll get extra spacing that can’t be removed with CSS!

Screen Shot 2014-03-12 at 12.03.47 PM.png

Dealing with Gmail

This design is certainly adequate for both mobile and desktop clients, but it’s not the best we can do.  Desktop clients and large tablets have a lot of screen real estate that we’re wasting.

Our main target here is Gmail—desktop and laptop screens keep getting bigger, and we want Gmail users to get a full-width experience.  But Gmail doesn’t support media queries, the go-to way of showing different layouts on different-sized clients.  What can we do?

I mentioned earlier that Gmail supports a small subset of CSS inside <style> tags.  This is not a widely known feature of Gmail—most resources you’ll find tell you that Gmail only supports inline styles.  Only a handful of blog comments and forum posts mention this support.  I don’t know when Gmail’s CSS support was quietly improved, but I was certainly pleased to learn about this new way of styling my emails.

The subset of CSS that Gmail supports is that you are limited to only using tag name selectors—no classes or IDs are supported.  Coupled with Gmail’s limited whitelist of HTML elements, your flexibility in styling different parts of your email differently is severely limited.  Plus, the <style> tag must be in the <head> of your email, not in the <body>.

The trick is to make judicious use of CSS’s structural selectors: the descendant, adjacent, and child selectors.  By carefully structuring your HTML and mixing and matching these selectors, you can pinpoint elements for providing styles.  Here are the styles I’ve applied to show a two-column layout in Gmail:

        <head>
          <style type="text/css">
/*  1 */    table table {
/*  2 */      width: 600px !important;
/*  3 */    }
/*  4 */    table div + div { /* main content */
/*  5 */      width: 65%;
/*  6 */      float: left;
/*  7 */    }
/*  8 */    table div + div + div { /* sidebar */
/*  9 */      width: 33%;
/* 10 */      float: right;
/* 11 */    }
/* 12 */    table div + div + div + div { /* footer */
/* 13 */      width: 100%;
/* 14 */      float: none;
/* 15 */      clear: both;
/* 16 */    }
          </style>
        </head>

In the absence of classes and IDs to tell you what elements are being styled, comments are your friend!  Let’s walk through each of these selectors.

Lines 1-3 lock our layout to a fixed width.  Remember, this is our style for Gmail on the desktop, where a fluid design isn’t our goal.  We apply this to the inner wrapping table so that padding on the outer one remains, around our 600-pixel-wide content.  Without having both tables, we’d lose the padding that keeps our content from running into the client’s UI.

Next, we style the main content.  The selector on line 4, reading right to left, finds a div immediately following another div, inside a table.  That actually matches the main content, sidebar, and footer divs, but that’s OK for now.  We style it to take up the left two thirds of the content area (minus a small gutter between the columns).

The selector on line 8 styles the sidebar, by finding a div following a div following a div, inside a table. This selects both the footer and the sidebar, but not the main content, and overrides the preceding styles, placing the sidebar in the right-hand third of the content.

Finally, we select the footer on line 12—the only div that follows three others—and make it full-width.  Since the proceeding selectors and styles also applied to this footer div, we need to reset the float style back to none (on line 14).

With that, we have a two-column fixed layout for Gmail, without breaking the one-column view for Android:

Screen Shot 2014-03-12 at 11.30.41 AM.png

The styles we applied to the outer wrapping table keep our content centered, and the other inline styles that we didn’t override (such as the line above the footer) are still rendered.

For Modern Browsers

Finally, let’s consider Mail.app on iOS and Mac OS X.  I’m lumping them together because they have similar rendering engines and CSS support—the media queries and full CSS selectors you know and love all work.  The styles we applied for Gmail will be also applied on iPhones, giving a mobile-unfriendly fixed-width layout.  We want Android’s single-column fluid layout instead.  We can target modern, small-screen clients (like iPhones) with a media query:

/* Inside <style> in the <head> */
@media (max-width: 630px) {
  table table {
    width: 96% !important;
  }
  table div {
    float: none !important;
    width: 100% !important;
  }
}

These styles override the earlier ones to restore the single-column layout, but only for devices under 630 pixels wide—the point at which our fixed 600-pixel layout would begin to scroll horizontally.  Don’t forget the !important flag, which makes these styles override the earlier ones.  Gmail.com and Android will both ignore this media query.  iPads and Mail.app on the desktop, which are wider than 630 pixels, will also show the desktop style.

This is admittedly not the prettiest approach. With multiple levels of overriding selectors, you need to think carefully about the impact of any change to your styles.  As your design grows in complexity, you need to keep a handle on which elements will be selected where, particularly with the tag-only selectors for Gmail.  But it’s nearly the holy grail of HTML email: responsive, flexible layouts even in clients with limited support for CSS.

The biggest caveat of this approach (besides the complexity of the code) is the layout on Android tablets: they will display the same single-column layout as Android phones.  For us (and probably for you, too), Android tablets are a vanishingly small percentage of our users.  In any case, the layout isn’t unusable, it’s just not optimal, with wide columns and needlessly large images.

Bringing it All Together

You can find the complete code for this example in this gist: https://gist.github.com/kevingessner/9509148

You can extend this approach to build all kinds of complex layouts.  Just keep in mind the three places where every item might be styled: inline CSS, tag-only selectors in a <style> tag, and one or more media query blocks.  Apply your styles carefully, and don’t forget to test your layout in every client you can your hands on!

I hope that in the future, Gmail on the web and on Android will enter the 21st century and add support for the niceties that CSS has added in recent years.  Until then, creating an HTML email that looks great on every client will continue to be a challenge.  But with a few tricks like these up your sleeve, you can make a beautiful email that gives a great experience for everyone on every client.

You can follow me on Twitter at @kevingessner

Want to help us make Etsy better, from email to accounting? We’re hiring!

16 Comments

Etsy’s Journey to Continuous Integration for Mobile Apps

Posted by on February 28, 2014 / 9 Comments

Positive app reviews can greatly help in user conversion and the image of a brand. on the other hand, bad reviews can have dramatic consequences; as Andy Budd puts it: “Mobile apps live and die by their ratings in an App Store”.

Image

Screen Shot 2014-02-28 at 3.51.58 PM

The above reviews are actual reviews of the Etsy iOS App. As an Etsy developer, it is sad to read them, but it’s a fact: bugs sometimes sneak through our releases. On the web stack, we use our not so secret weapon of Continuous Delivery as a safety net to quickly address bugs that make it to production. However, releasing mobile apps requires a 3rd party’s approval (the app store) , which takes five days on average; once an app is approved, users decide when to upgrade – so they may be stuck with older versions. Based on our analytics data, we currently have 5 iOS and 10 Android versions currently in use by the public.

Through Continuous Integration (CI), we can detect and fix major defects in the development and validation phase of the project, before they negatively impact user experience: this post explores Etsy’s journey to implementing our CI pipeline for our android and iOS applications.

“Every commit should build the mainline on an integration machine”

This fundamental CI principle is the first step to detecting defects as soon as they are introduced: failing to compile. Building your app in an IDE does not count as Continuous Integration. Thankfully, both iOS and Android are command line friendly: building a release of the iOS app is as simple as running:

xcodebuild -scheme "Etsy" archive

Provisioning integration machines

Integration machines are separate from developer machines – they provide a stable, controlled, reproducible environment for builds and tests. Ensuring that all the integration machines are identical is critical – using a provisioning framework to manage all the dependencies is a good solution to ensure uniformity and scalability.

At Etsy, we are pretty fond of Chef to manage our infrastructure – we naturally turned to it to provision our growing Mac Mini fleet. Equipped with the homebrew cookbook for installing packages and rbenv cookbook for managing the ruby environment in a relatively sane way, our sysops wizard Jon Cowie sprinkled a few hdiutil incantations (to manage disk images) and our cookbooks were ready. We are now able to programmatically install 95% of Xcode (some steps are still manual), Git, and all the Android packages required to build and run the tests for our apps.

Lastly, if you ever had to deal with iOS provisioning profiles, you can relate to how annoying they are to manage and keep up to date; having a centralized system that manages all our profiles saves a lot of time and frustration for our engineers.

Building on push and providing daily deploys

With our CI machines hooked up to our Jenkins server, setting up a plan to build the app on every git push is trivial. This simple step helps us detect missing files from commits or compilation issues multiple times a week – developers are notified in IRC or by email and build issues are addressed minutes after being detected. Besides building the app on push, we provide a daily build that any Etsy employee can install on their mobile device – the quintessence of dogfooding. An easy way to encourage our coworkers to install pre-release builds is to nag them when they use the app store version of the app.

Image

Testing

iOS devices come in many flavors, with seven different iPads, five iPhones and a few iPods; when it comes to Android, the plethora of devices becomes overwhelming. Even when focusing on the top tier of devices, the goal of CI is to detect defects as soon as they are introduced: we can’t expect our QA team to validate the same features over and over on every push!

Our web stack boasts a pretty extensive collection of test suites and the test driven development culture is palpable. Ultimately, our mobile apps leverage a lot of our web code base to deliver content: data is retrieved from the API and many screens are web views. Most of the core logic of our apps rely on the UI layer – which can be tested with functional tests. As such, our first approach was to focus on some functional tests, given that the API was already tested on the web stack (with unit tests and smoke tests).

Functional tests for mobile apps are not new and the landscape of options is still pretty extensive; in our case, we settled down on Calabash and Cucumber. The friendly format and predefined steps of Cucumber + Calabash allows our QA team to write tests themselves without any assistance from our mobile apps engineers.

To date, our functional tests run on iPad/iPhone iOS 6 and 7 and Android, and cover our Tier 1 features, including:

Because functional tests mimic the steps of an actual user, the tests require that certain assumed resources exist. In the case of the Checkout test, these are the following:

Our checkout test then consists of:

  1. signing in to the app with our test buyer account
  2. searching for an item (in the seller test account shop)
  3. adding it to the cart
  4. paying for the item using the prepaid credit card

Once the test is over, an ad-hoc mechanism in our backend triggers an order cancellation and the credit card is refunded.

A great example of our functional tests catching bugs is highlighted in the following screenshot from our iPad app:

Image

Our registration test navigates to this view, and fills out all the visible fields. Additionally, the test cycles through the “Female“, “Male” and “Rather Not Say” options; in this case, the tests failed (since the “male” option was missing).

By running our test suite every time an engineer pushes code, we not only detect bugs as soon as they are introduced, we detect app crashes. Our developers usually test their work on the latest OS version but Jenkins has their back: our tests run simultaneously across different combinations of devices and OS versions.

Testing on physical devices

While our developers enjoy our pretty extensive device lab for manual testing, maintaining a set of devices and constantly running automated tests on them is a logistical nightmare and a full time job. After multiple attempts at developing an in-house solution, we decided to use Appthwack to run our tests on physical devices. We run our tests for every push on a set of dedicated devices, and run nightly regression on a broader range of devices by tapping into Appthwack cloud of devices. This integration is still very recent and we’re still working out some kinks related to testing on physical devices and the challenges of aggregating and reporting test status from over 200 devices.

Reporting: put a dashboard on it

With more than 15 Jenkins jobs to build and run the tests, it can be challenging to quickly surface critical information to the developers. A simple home grown dashboard can go a long way to communicating the current test status across all configurations:

Mobile apps dashboard

Static analysis and code reviews

Automated tests cannot catch all bugs and potential crashes – similar to the web stack, developers rely heavily on code reviews prior to pushing their code. Like all code at Etsy, the apps are stored in Github Enterprise repositories, and code reviews consist of a pull request and an issue associated to it. By using the GitHub pull request builder Jenkins plugin, we are able to systematically trigger a build and do some static analysis (see static analysis with OCLint post ) on the review request and post the results to the Github issue:

pull request with lint results

Infrastructure overview summary

All in all, our current infrastructure looks like the following:Mobile Apps Infrastructure overview

Challenges and next steps

Building our continuous integration infrastructure was strenuous and challenges kept appearing one after another, such as the inability to automate the installation of some software dependencies. Once stable, we always have to keep up with new releases (iOS 7, Mavericks) which tend to break the tests and the test harness. Furthermore, functional tests are flaky by nature, requiring constant care and optimization.

We are currently at a point where our tests and infrastructure are reliable enough to detect app crashes and tier 1 bugs on a regular basis. Our next step, from an infrastructure point of view, is to expand our testing to physical devices via our test provider Appthwack. The integration has just started but already raises some issues: how can we concurrently run the same checkout test (add an item to cart, buy it using a gift card) across 200 devices – will we create 200 test accounts, one per device? We will post again on our status 6 months from now, with hopefully more lessons learned and success stories – stay tuned!

You can follow Nassim on Twitter at @kepioo 

9 Comments

Reducing Domain Sharding

Posted by on February 19, 2014 / 3 Comments

This post originally appeared on the Perf Planet Performance Calendar on December 7th, 2013.

Domain sharding has long been considered a best practice for pages with lots of images.  The number of domains that you should shard across depends on how many HTTP requests the page makes, how many connections the client makes to each domain, and the available bandwidth.  Since it can be challenging to change this dynamically (and can cause browser caching issues), people typically settle on a fixed number of shards – usually two.

An article published earlier this year by Chromium contributor William Chan outlined the risks of sharding across too many domains, and Etsy was called out as an example of a site that was doing this wrong.  To quote the article: “Etsy’s sharding causes so much congestion related spurious retransmissions that it dramatically impacts page load time.”  At Etsy we’re pretty open with our performance work, and we’re always happy to serve as an example.  That said, getting publicly shamed in this manner definitely motivated us to bump the priority of reinvestigating our sharding strategy.

Making The Change

The code changes to support fewer domains were fairly simple, since we have abstracted away the process that adds a hostname to an image path in our codebase.  Additionally, we had the foresight to exclude the hostname from the cache key at our CDNs, so there was no risk of a massive cache purge as we switched which domain our images were served on.  We were aware that this would expire the cache in browsers, since they do include hostname in their cache key, but this was not a blocker for us because of the improved end result.  To ensure that we ended up with the right final number, we created variants for two, three, and four domains.  We were able to rule out the option to remove domain sharding entirely through synthetic tests.  We activated the experiment in June using our A/B framework, and ran it for about a month.

Results

After looking at all of the data, the variant that sharded across two domains was the clear winner.  Given how easy this change was to make, the results were impressive:

As it turns out, William’s article was spot on – we were sharding across too many domains, and network congestion was hurting page load times.  The new CloudShark graph supported this conclusion as well, showing a peak throughput improvement of 33% and radically reduced spurious retransmissions:

Before – Four Shards

before

After – Two Shards

after

Lessons Learned

This story had a happy ending, even though in the beginning it was a little embarrassing.  We had a few takeaways from the experience:

Until SPDY/HTTP 2.0 comes along, domain sharding can still be a win for your site, as long as you test and optimize the number of domains to shard across for your content.

3 Comments

December 2013 Site Performance Report

Posted by on January 23, 2014 / 4 Comments

It’s a new year, and we want to kick things off by filling you in on site performance for Q4 2013. Over the last three months front-end performance has been pretty stable, and backend load time has increased slightly across the board.

Server Side Performance

Here are the median and 95th percentile load times for signed in users on our core pages on Wednesday, December 18th:

Server Side Performance December 2013

There was an across the board increase in both median and 95th percentile load times over the last three months, with a larger jump on our search results page. There are two main factors that contributed to this increase: higher traffic during the holiday season and an increase in international traffic, which is slower due to translations. On the search page specifically, browsing in US English is significantly faster than any other language. This isn’t a sustainable situation over the long term as our international traffic grows, so we will be devoting significant effort to improving this over the next quarter.

Synthetic Front-end Performance

As usual, we are using our private instance of WebPagetest to get synthetic measurements of front-end load time. We use a DSL connection and test with IE8, IE9, Firefox, and Chrome. The main difference with this report is that we have switched from measuring Document Complete to measuring Speed Index, since we believe that it provides a better representation of user perceived performance. To make sure that we are comparing with historical data, we pulled Speed Index data from October for the “old” numbers. Here is the data, and all of the numbers are medians over a 24 hour period:

Synthetic Front-End Performance December 2013

Start render didn’t really change at all, and speed index was up on some pages and down on others. Our search results page, which had the biggest increase on the backend, actually saw a 0.2 second decrease in speed index. Since this is a new metric we are tracking, we aren’t sure how stable it will be over time, but we believe that it provides a more accurate picture of what our visitors are really experiencing.

One of the downsides of our current wpt-script setup is that we don’t save waterfalls for old tests – we only save the raw numbers. Thus when we see something like a 0.5 second jump in Speed Index for the shop page, it can be difficult to figure out why that jump occurred. Luckily we are Catchpoint customers as well, so we can turn to that data to get granular information about what assets were on the page in October vs. December. The data there shows that all traditional metrics (render start, document complete, total bytes) have gone down over the same period. This suggests that the jump in speed index is due to loading order, or perhaps a change in what’s being shown above the fold. Our inability to reconcile these numbers illustrates a need to have visual diffs, or some other mechanism to track why speed index is changing. Saving the full WebPagetest results would accomplish this goal, but that would require rebuilding our EC2 infrastructure with more storage – something we may end up needing to do.

Overall we are happy with the switch to speed index for our synthetic front-end load time numbers, but it exposed a need for better tooling.

Real User Front-end Performance

These numbers come from mPulse, and are measured via JavaScript running in real users’ browsers:

Real User Front-end Performance December 2013

There aren’t any major changes here, just slight movement that is largely within rounding error. The one outlier is search, especially since our synthetic numbers showed that it got faster. This illustrates the difference between measuring onload, which mPulse does, and measuring speed index, which is currently only present in WebPagetest. This is one of the downsides of Real User Monitoring – since you want the overhead of measurement to be low, the data that you can capture is limited. RUM excels at measuring things like redirects, DNS lookup times, and time to first byte, but it doesn’t do a great job of providing a realistic picture of how long the full page took to render from the customer’s point of view.

Conclusion

We have a backend regression to investigate, and front-end tooling to improve, but overall there weren’t any huge surprises. Etsy’s performance is still pretty good relative to the industry as a whole, and relative to where we were a few years ago. The challenge going forward is going to center around providing a great experience on mobile devices and for international users, as the site grows and becomes more complex.

4 Comments

Static Analysis with OCLint

Posted by on January 15, 2014 / 2 Comments

At Etsy, we’re big believers in making tools do our work for us.

On the mobile apps team we spend most of our time focused on building new features and thinking about how the features of Etsy fit into an increasingly mobile world. One of the great things about working at Etsy is that we have a designated period called Code Slush around the winter holidays where product development slows down and we can take stock of where we are and do things that we think are important or useful, but that don’t fit into our normal release cycles. Even though our apps team releases significantly less frequently than our web stack, making it easier to continue developing through the holiday season, we still find it valuable to take this time out at the end of the year.

This past slush we spent some of that time contributing to the OCLint project and integrating it into our development workflow. OCLint, as the name suggests, is a linter tool for Objective-C. It’s somewhat similar to the static analyzer that comes built into Xcode, and it’s built on the same clang infrastructure. OCLint is a community open source project and all of the changes to it we’re discussing have been contributed back and are available with the rest of OClint on their github page.

If you run OCLint on your code it will tell you things like, “This method is suspiciously long” or “The logic on this if statement looks funny”. In general, it’s great at identifying these sorts of code smells. We thought it would be really cool if we could extend it to find definite bugs and to statically enforce contracts in our code base. In the remainder of this post, we’re going to talk about what those checks are and how we take advantage of them, both in our code and in our development process.

Rules

Objective-C is a statically typed Object Oriented language. Its type system gets the job done, but it’s fairly primitive in certain ways. Often, additional contracts on a method are specified as comments. One thing that comes up sometimes is knowing what methods a subclass is required to implement. Typically this is indicated in a comment above the method.

For example, UIActivity.h contains the comment // override methods above a list of several of its methods.

This sort of contract is trivial to check at compile time, but it’s not part of the language, making these cases highly error prone. OCLint to the rescue! We added a check for methods that subclasses are required to implement. Furthermore, you can use the magic of Objective-C categories to mark up existing system libraries.

To mark declarations, oclint uses clang’s __attribute__((annotate(“”))) feature to pass information from your code to the checker.
To make these marks on a system method like the -activityType method in UIActivity, you would stick the following in a header somewhere:

@interface UIActivity (StaticChecks)
...
- (NSString *)activityType
__attribute__((annotate(“oclint:enforce[subclass must implement]”)));
...
@end

That __attribute__ stuff is ugly and hard to remember so we #defined it away.

#define OCLINT_SUBCLASS_MUST_IMPLEMENT 
__attribute__((annotate(“oclint:enforce[subclass must implement]”)))

Now we can just do:

@interface UIActivity (StaticChecks)
...
- (NSString *)activityType OCLINT_SUBCLASS_MUST_IMPLEMENT;
...
@end

 

We’ve contributed back a header file with these sorts of declarations culled from the documentation in UIKit that anyone using oclint can import into their project. We added this file into our project’s .pch file so it’s included in every one one of our classes automatically.

Some other checks we’ve added:

Protected Methods

This is a common feature in OO languages – methods that only a subclass and its children can call. Once again, this is usually indicated in Objective-C by comments or sometimes by sticking the declarations in a category in separate header. Now we can just tack on OCLINT_PROTECTED_METHOD at the end of the declaration. This makes the intent clear, obvious, and statically checked.

Prohibited Calls

This is another great way to embed institutional knowledge directly into the codebase. You can mark methods as deprecated using clang, but this is an immediate compiler error. We’ll talk more about our workflow later, but doing it through oclint allow us to migrate from old to new methods gradually and easily use things while debugging that we wouldn’t want to commit.

We have categories on NSArray and NSDictionary that we use instead of the built in methods, as discussed here. Marking the original library methods as prohibited lets anyone coming into our code base know that they should be using our versions instead of the built in ones. We also have a marker on NSLog, so that people don’t accidentally check in debug logs. Frequently the replacement for the prohibited call calls the prohibited call itself, but with a bunch of checks and error handling logic. We use oclint’s error suppression mechanism to hide the violation that would be generated by making the original call. This is more syntactically convenient than dealing with clang pragmas like you would have to using the deprecated attribute.

Ivar Assignment Outside Getters

We prefer to use properties whenever possible as opposed to bare ivar accesses. Among other things, this is more syntactically and semantically regular and makes it much easier to set breakpoints on changes to a given property when debugging.  This rule will emit an error when it sees an ivar assigned outside its getter, setter, or the owning class’s init method.

-isEquals: without -hash

In Cocoa, if you override the -isEquals: method that checks for object equality, it’s important to also override the -hash method. Otherwise you’ll see weird behavior in collections when using the object as a dictionary key. This check finds classes that implement -isEquals: without implementing -hash. This is another great example of getting contracts out of comments and into static checks.

Workflow

We think that oclint adds a lot of value to our development process, but there were a couple of barriers we had to deal with to make sure our team picked it up. First of all, any codebase written without oclint’s rules strictly enforced for all of development will have tons of minor violations. Sometimes the lower priority things it warns about are actually done deliberately to increase code clarity. To cut down on the noise we went through and disabled a lot of the rules, leaving only the ones we thought added significant value. Even with that, there were still a number of things it complained frequently about – things like not using Objective-C collection literals. We didn’t want to go through and change a huge amount of code all at once to get our violations down to zero, so we needed a way to see only the violations that were relevant to the current change. Thus, we wrote a little script to only run oclint on the changed files. This also allows us to easily mark something as no longer recommended without generating tons of noise, having to remove it entirely from our codebase, or fill up Xcode’s warnings and errors.

Finally, we wanted to make it super easy for our developers to start using it. We didn’t want to require them to run it manually before every commit. That would be just one more thing to forget and one more thing anyone joining our team would have to know about. Plus it’s kind of slow to run all of its checks on a large codebase. Instead, we worked together with our terrific testing and automation team to integrate it into our existing github pull request workflow. Now, whenever we make a pull request, it automatically kicks off a jenkins job that runs oclint on the changed files. When the job is done, it posts a summary a comment right to the pull request along with a link to the full report on jenkins. This ended up feeling very natural and similar to how we interact with the php code sniffer on our web stack.

Conclusion

We think oclint is a great way to add static checks to your Cocoa code. There are some interesting things going on with clang plugins and direct Xcode integration, but for now we’re going to stick with oclint. We like its base of existing rules, the ease of gradually applying its rules to our code base, and its reporting options and jenkins integration.

We also want to thank the maintainer and the other contributors for the hard work they’ve been put into the project. If you use these rules in interesting ways, or even boring ones, we’d love to hear about it. Interested in a working at a place that cares about the quality of its software and about solving its own problems instead of just letting them mount? Our team is hiring!

2 Comments

Android Staggered Grid

Posted by on January 13, 2014 / 4 Comments

While building the new Etsy for Android app, a key goal for our team was delivering a great user experience regardless of device. From phones to tablets, phablets and even tabphones, we wanted all Android users to have a great shopping experience without compromise.

A common element throughout our mobile apps is the “Etsy listing card”. Listing cards are usually the first point of contact users have with our sellers’ unique items, whether they’re casually browsing through categories or searching for something specific. On these screens, when a listing card is shown, we think our users should see the images without cropping.

Android Lists and Grids

A simple enough requirement but on Android things aren’t always that simple. We wanted these cards in a multi-column grid, with a column count that changes with device orientation while keeping grid position. We needed header and footer support, and scroll listeners for neat tricks like endless loading and quick return items. This wasn’t achievable using a regular old ListView or GridView.

Furthermore, both the Android ListView and GridView are not extendable in any meaningful way. A search for existing open libraries didn’t reveal any that met our requirements, including the unfinished StaggeredGridView available in the AOSP source.

Considering all of these things we committed to building an Android staggered grid view. The result is a UI component that is built on top of the existing Android AbsListView source for stability, but supports multiple columns of varying row sizes and more.

Android Trending - Devices

How it Works

The StaggeredGridView works much like the existing Android ListView or GridView. The example below shows how to add the view to your XML layout, specifying the margin between items and the number of columns in each orientation.

<com.etsy.android.grid.StaggeredGridView
    xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:app="http://schemas.android.com/apk/res-auto"
    android:id="@+id/grid_view"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    app:item_margin="8dp"
    app:column_count_portrait="2"
    app:column_count_landscape="3" />

You can of course set the layout margin and padding as you would any other view.

To show items in the grid create any regular old ListAdapter and assign it to the grid. Then there’s one last step. You need to ensure that the ListAdapter’s views maintain their aspect ratio. When column widths adjust on rotation, each item’s height should respond.

How do you do this? The AndroidStaggeredGrid includes a couple of utility classes including the DynamicHeightImageView which you can use in your adapter. This custom ImageView overrides onMeasure() and ensures the measured height is relative to the width based on the set ratio. Alternatively, you can implement any similar custom view or layout with the same measurement logic.

public void setHeightRatio(double ratio) {
    if (ratio != mHeightRatio) {
        mHeightRatio = ratio;
        requestLayout();
    }
}

@Override
protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec) {
    if (mHeightRatio > 0.0) {
        int width = MeasureSpec.getSize(widthMeasureSpec);
        int height = (int) (width * mHeightRatio);
        setMeasuredDimension(width, height);
    }
    else {
        super.onMeasure(widthMeasureSpec, heightMeasureSpec);
    }
}

And that’s it. The DynamicHeightImageView will maintain the aspect ratio of your items and the grid will take care of recycling views in the same manner as a ListView. You can check out the GitHub project for more details on how it’s used including a sample project.

But There’s More

Unlike the GridView, you can add header and footer views to the StaggeredGridView. You can also apply internal padding to the grid that doesn’t affect the header and footer views. An example view using these options is shown below. On our search results screen we use a full width header and add a little extra horizontal grid padding for 10 inch tablets.

Android Search

Into the Real World

During the development process we fine-tuned the grid’s performance using a variety of real world Android devices available in the Etsy Device Lab. When we released the new Etsy for Android app at the end of November, the AndroidStaggeredGrid was used throughout. Post launch we monitored and fixed some lingering bugs found with the aid of the awesome crash reporting tool Crashlytics.

We decided to open source the AndroidStaggeredGrid: a robust, well tested and real world UI component for the Android community to use. It’s available on GitHub or via Maven, and we are accepting pull requests.

Finally, a friendly reminder that the bright folks at Etsy mobile are hiring.

You can follow Deniz on Twitter at @denizmveli.

4 Comments

Migrating to Chef 11

Posted by on October 16, 2013 / 8 Comments

Configuration management is critical to maintaining a stable infrastructure. It helps ensure systems and software are configured in a consistent and deterministic way. For configuration management we use Chef.  Keeping Chef up-to-date means we can take advantage of new features and improvements. Several months ago, we upgraded from Chef 10 to Chef 11 and we wanted to share our experiences.

Prep

We started by setting up a new Chef server running version 11.6.0.  This was used to validate our Chef backups and perform testing across our nodes.  The general plan was to upgrade the nodes to Chef 11, then point them at the new Chef 11 server when we were confident that we had addressed any issues.  The first order of business: testing backups.  We’ve written our own backup and restore scripts and we wanted to be sure they’d still work under Chef 11.  Also, these scripts would come in handy to help us quickly iterate during break/fix cycles and keep the Chef 10 and Chef 11 servers in sync.  Given that we can have up to 70 Chef developers hacking on cookbooks, staying in sync during testing was crucial to avoiding time lost to troubleshooting issues related to cookbook drift.

Once the backup and restore scripts were validated, we reviewed the known breaking changes present in Chef 11.  We didn’t need much in the way of fixes other than a few attribute precedence issues and updating our knife-lastrun handler to use run_context.loaded_recipes instead of node.run_state().

Unforeseen Breaking Changes

After addressing the known breaking changes, we moved on to testing classes of nodes one at a time.  For example, we upgraded a single API node to Chef 11, validated Chef ran cleanly against the Chef 10 server, then proceeded to upgrade the entire API cluster and monitor it before moving on to another cluster.  In the case of the API cluster, we found an unknown breaking change that prevented those nodes from forwarding their logs to our log aggregation hosts.  This episode initially presented a bit of a boondoggle and warrants a little attention as it may help others during their upgrade.

The recipe we use to configure syslog-ng sets several node attributes, for various bits and bobs.  The following line in our cookbook is where all the fun started:

if !node.default[:syslog][:items].empty?

That statement evaluated to false on the API nodes running Chef 11 and resulted in a vanilla syslog-ng.conf file that didn’t direct the service to forward any logs.  Thinking that we could reference those nested attributes via the :default symbol, we updated the cookbook.  The Chef 11 nodes were content but all of the Chef 10 nodes were failing to converge because of the change.  It turns out that accessing default attributes via the node.default() method and node[:default] symbol are not equivalent.  To work around this, we updated the recipe to check for Chef 11 or Chef 10 behavior and assign our variables accordingly.  See below for an example illustrating this:

if node[:syslog].respond_to?(:has_key?)
    # Chef 11
    group = node[:syslog][:group] || raise("Missing group!")
    items = node[:syslog][:items]
else
    # Chef 10
    group = node.default[:syslog][:group] || raise("Missing group!")
    items = node.default[:syslog][:items]
end

In Chef 11, the :syslog symbol points to the key in the attribute namespace (it’s an ImmutableHash object) we need and responds to the .has_key?() method; in that case, we pull in the needed attributes Chef 11-style.  If the client is Chef 10, that test fails and we pull in the attributes using the .default() method.

Migration

Once we had upgraded all of our nodes and addressed any issues, it was time to migrate to the Chef 11 server.  To be certain that we could recreate the build and that our Chef 11 server cookbooks were in good shape, we rebuilt the Chef 11 server before proceeding.  Since we use a CNAME record to refer to our Chef server in the nodes’ client.rb config file, we thought that we could simply update our internal DNS systems and break for an early beer.  To be certain, however, we ran a few tests by pointing a node at the FQDN of the new Chef server.  It failed its Chef run.

Chef 10, by default, communicates to the server via HTTP; Chef 11 uses HTTPS.  In general, Chef 11 Server redirects the Chef 11 clients attempting to use HTTP to HTTPS.  However, this breaks down when the client requests cookbook versions from the server.  The client receives an HTTP 405 response.  The reason for this is that the client sends a POST to the following API endpoint to determine which versions of the cookbooks from its run_list need to be downloaded:

/environments/production/cookbook_versions

If Chef is communicating via HTTP, the POST request is redirected to use HTTPS.  No big deal, right?  Well, RFC 2616 is pretty clear that when a request is redirected, “[t]he action required MAY be carried out by the user agent without interaction with the user if and only if the method used in the second request is GET…”  When the Chef 11 client attempts to hit the /environments/cookbook_versions endpoint via GET, Chef 11 Server will respond with an HTTP 405 as it only allows POST requests to that resource.

The fix was to update all of our nodes’ client configuration files to use HTTPS to communicate with the Chef Server.  dsh (distributed shell) made this step easy.

Just before we finalized the configuration update, we put a freeze on all Chef development and used our backup and restore scripts to populate the new Chef 11 server with all the Chef objects (nodes, clients, cookbooks, data bags, etc) from the Chef 10 server.  After validating the restore operation, we completed the client configuration updates and shut down all Chef-related services on the Chef 10 server.  Our nodes happily picked up where they’d left off and continued to converge on subsequent Chef runs.

Post-migration

Following the migration, we found two issues with chef-client that required deep dives to understand, and correct, what was happening.  First, we had a few nodes whose chef-client processes were exhausting all available memory.  Initially, we switched to running chef-client in forking mode.  Doing so mitigated this issue to an extent (as the forked child released its allocated memory when it completed and was reaped) but we were still seeing an unusual memory utilization pattern.  Those nodes were running a recipe that included nested searches for nodes.  Instead of returning the node names and searching on those, we were returning whole node objects.  For a long-running chef-client process, this continued to consume available memory.  Once we corrected that issue, memory utilization fell down to acceptable levels.

See the following screenshot illustrating the memory consumption for one of these nodes immediately following the migration and after we updated the recipe to return references to the objects instead:

deploy_host_chef_nested_searches_mem_util

Here’s an example of the code in the recipe that created our memory monster:

# find nodes by role, the naughty, memory hungry way
roles = search(:role, '*:*')    # NAUGHTY
roles.each do |r|
  nodes_dev = search(:node, "role:#{r.name} AND fqdn:*dev.*")    # HUNGRY
  template "/etc/xanadu/#{r.name.downcase}.cfg" do
  ...
  variables(
    :nodes => nodes_dev
  )
  end
end

Here’s the same code example, returning object references instead:

# find nodes by role, the community-friendly, energy-conscious way
search(:role, '*:*') do |r|
  fqdns = []
  search(:node, "role:#{r.name} AND fqdn:*dev.*") do |n|
    fqdns << n.fqdn
  end
  template "/etc/xanadu/#{r.name.downcase}.cfg" do
    ...
    variables(
      :nodes => fqdns
    )
  end
end

Second, we found an issue where, in cases where chef-client would fail to connect to the server, it would leave behind its PID file, preventing future instances of chef-client from starting.  This has been fixed in version 11.6.0 of chef-client.

Conclusion

Despite running into a few issues following the upgrade, thorough testing and Opscode’s documented breaking changes helped make our migration fairly smooth. Further, the improvements made in Chef 11 have helped us improve our cookbooks. Finally, because our configuration management system is updated, we can confidently focus our attention on other issues.

8 Comments