Is It A Bird? Is It A Plane? No, It’s Supergrep!

Posted by John Goulah | Filed under engineering

Etsy parties have come to earn quite the boisterous reputation — there may or may not have been kegs dragged up to the roof on a particularly balmy occasion, not to mention cadaver eating contests and mariachi bands. Thus, one of the greatest survival skills I have come to hone here at HQ is the art of selective hearing. Being able to focus in on the subtle wit of your coworker in the midst of a rowdy congo line is the only way you will leave at the end of the evening with your sanity (and self-respect) intact. Luckily for us, eliminating the noise in our data using technology is a little bit easier.

After a long week two years ago with a number of site incidents, our then-CTO Chad Dickerson sent an email to our all-engineering mailing list about a new tool:

For a few of the incidents we’ve had in the past week, we’ve found
obvious trails in the error logs leading directly to the source of the
problem. I talked to a bunch of folks and what I generally heard was
that the logs were noisy, which is technically true, but. . . . .

Through the magic of grep, you can get past the noise. I did some
digging through the error logs today to better understand what was in
there and wrote a simple ugly grep that kills all the stuff that
repeats a lot (not that we don’t need to fix it).

And thus supergrep was born. The original was a simple grep -v command line bash script to make our logs easier to parse, but the tool has evolved into a communication mechanism for the entire team to view our logs in real time over a web browser. Since then, we have cleaned up many of the noise problems that obscured our logs, while iterating the tool into something easily accessible by everyone on team. It’s a simple node app that everyone on the team can access as they are pushing code, to see if new errors appear. In addition we’ve linked the errors to the exact line of code that the error is happening on for easier analysis. Its yet another change-awareness tool in our kit that allows us to push code continuously and safely to production.

We’ve open sourced the code on our github repository, which you can find here: https://github.com/etsy/supergrep

13 responses

Javascript: Keeping It Classy

Posted by Ahmed Hashim | Filed under engineering, philosophy

Here at Etsy we employ a few internal tools for employees to help administer and moderate the public-facing site you have all come to know and love. One of these tools to which we introduced you to earlier (Compass) is used by our support team to handle member inquiries and oversee listings and shops. Last summer we were tasked with building it from the ground up with the sole purpose of managing emails from members to our support team. Since then, our support team has grown, as have its needs, and Compass has grown with them. Now, along with handling emails, it includes a host of new tools to allow staff to troubleshoot problems and communicate with each other while doing so. A side effect of this rapid growth was a lot of code on the front-end (mostly Javascript) hacked together by the team to keep up with our customers’ demands.

We’re using the same development stack as our public-facing site, so the code is shared between the public-facing site and Compass, which allows for consistency and ease of use when handling data. Essentially, deploying the public-facing site will deploy Compass and most of our internal tools stack at the same time (and vice versa). This lets everyone play nice with each other, and minimizes the amount of toes being stepped on by other developers. It also means that common libraries are always up to date and in sync, so if a developer changes a method for the public-facing site – it will update for Compass at the same time. Meaning we’ll be able to access the same data in the same way.

Problems

What had initially started off as a small project to build an email support tool quickly turned into a full-featured app that handled everything from sending emails to flagging and closing accounts. With all of the development happening in a matter of a few weeks, it meant that there wasn’t much time or thought given to going back and refactoring much of the slapdash code we had written for Compass.

With the quick and iterative development schedule, we established a design on the front-end and rushed through to fill it out based on the requirements we were given. Without setting aside any time to refactor, we frequently just added more code to the end of what we’d written in a previous iteration.

For example, whenever we had to add Javascript functionality to the app, we tacked it onto the end of the massive $(document).ready() call in the base file (base.js). Sometimes we’d move code out to other files to make it more sane, but still kept it pretty procedural and in a single $(document).ready() call within the file.

We had a few cases of copy-and-pasted code as well. It turned out that sometimes it was easier when building a new feature to harvest old code from a similar feature without DRY-ing it into functions and classes. This resulted in a lot of duplication of our original prototyped code:

Procedural Javascript that we used to prototype Compass

This also made it harder to integrate new developers into the team, as there wasn’t really a clear entry point in the file to add new code (or integration tests for the existing code). Since our way of doing things seemed to be going against the grain when compared to the rest of the stack, it made it difficult for people to move between the two when developing. It also resulted in less than spectacular front-end performance on Compass and other internal tools.

Due to the fact that most of the code was mostly in a single file which was being included on all pages, there was a ripple effect of bugs that appeared on a few pages when you changed something. Fixing one Javascript feature might break another somewhere else. There were also a lot of ugly hacks in the code put together as temporary stopgaps to patch things up. Because we never allotted any real time to go back and refactor, these hacks stuck around and added to the brittle nature of the code.

There were also places where we’d duplicated code from the public-facing site, meaning hundreds of lines of procedural jQuery statements that covered all aspects and outcomes of a single feature for different browsers, etc. A lot of this was unnecessary since the internal tools catered to a much smaller customer base (Etsy employees only), and we aren’t concerned with all edge cases of IE bugs (or IE at all for that matter).

What We Did

To start addressing the above issues, we looked around the rest of our stack and saw what other teams were doing for front-end solutions on their platforms. We borrowed a lot of the techniques and methodologies from the mobile web version of our site, and decided to employ those across our internal tools. The mobile team had some tough hurdles of their own:

  • They had to scale down more features of Etsy.com for a customer base larger than ours.
  • They needed to cater for various browser inconsistencies.
  • They were forced to squeeze the most front-end performance they could from the limited bandwidth that mobile devices are allotted.

In solving these problems, they outlined an awesome design for organizing and scaling a lean front-end layer for web apps.

Housekeeping

We first reorganized our directory and file structure to reflect the URL structure of the app. This meant one directory per page, with an index.js file in it to contain all the code that would be run on the page. In addition, we had a single base.js file that was strictly for keeping track of global app behavior and objects. In the end we had a file structure that looked something like this:

  • js/compass/base.js
  • js/compass/message/index.js

We also had a “root” modules folder for common bits of code that get used across the site:

  • js/compass/modules/assignment.js
  • js/compass/modules/image-overlay.js

These are just self-contained bits of code that can be instantiated from any page for elements that appear throughout multiple pages (for example, a global assignment button), or code that runs on the document across the app (image overlays, etc.). Most of these modules were objects that, upon instantiation, ran procedural functions which breathed life into the interactive elements on a page.

(function($, Etsy, Compass) {
    INPUT_TEXT = 'Search';

    Compass.SearchField = function(searchFieldSelector) {
        this.form = $(searchFieldSelector);

        this.setupFocusing();
    }

    Compass.SearchField.prototype = {
        setupFocusing: function() {
            this.form.find('input:text').focus(function() {
                if ($(this).val() == INPUT_TEXT) {
                    $(this).val('');
                }
            }).blur(function() {
                if ($(this).val() == '') {
                    $(this).val(INPUT_TEXT);
                }
            });
        }
    }
})(window.jQuery, window.Etsy, window.Compass);

By instantiating a new “SearchField” object and passing in the jQuery selector, it’ll automatically run the Javascript to enhance the user experience for the “focus” event handler. This can be applied on any part of the app, and doesn’t interfere with other elements on the page.

Using this philosophy of code organization does have some drawbacks, for example it can get a little tough to keep things DRY all the time. However, we figured that the tradeoff of having a neatly organized structure was worth allowing some code duplication in our stack.

Testing

Spending time refactoring the app was fairly straightforward, and allowed us an opportunity to add tests for Compass as well. Using a combination of RSpec, Capybara, and the Selenium testing suite, we were able follow a simple process to write our tests:

  1. Write a passing test for the feature before refactoring.
  2. Remove lame old Javascript code (causing the test to fail).
  3. Write sexy new Javascript, and make the test pass again.

Since most of our features more or less depended on Javascript, this allowed for a good portion of the app getting coverage during the cleanup.

Finishing Up

After it was all said and done, we were left with beautiful and concise files on each page of the app that were easy to maintain. Here’s an “after” shot of the index.js file for a Compass page that deals with handling resellers:

Swanky new Javascript

It’s been working out great so far, and addresses all of the issues we were having earlier pretty effectively. The overall Javascript codebase size is smaller, it loads more quickly, and is much easier to expand by new developers.

P.S. Those clever little “require” comments are an internal Javascript precompiler we’ve written which loads the corresponding files into the index.js file at runtime. We did this to follow the Rails sprockets conventions, which allows for compiler style requires on the server side instead of client side requires via jQuery. It also saves asset load time when the page renders, as they’re all loaded in a single file on the server instead of having the browser load them individually.

This makes it a little more difficult to debug individual files, but it’s not impossible with modern-day browser consoles (Firebug, Chrome developer tools, etc.). We also did some quick benchmarking against the old stack to set expectations and rules for allowable asset-load time per page.

Our next steps would be to migrate the public-facing magnification code/gzipping scheme to internal use and speed things up even more. Until then, we’re cruising steady with our optimized front-end performance, picking up thumbing Javascript features for an easy ride.

4 responses

Building websites with science

Posted by peterseibel | Filed under data, philosophy

At Etsy, we want to use data to inform what we do. We have an awesome culture of obsessing over graphs and building dashboards to track our progress. But beyond tracking the effects of our changes from minute to minute via graphs on big screens—which is a great way to see if a change has had some dramatic effect on, say, performance or error rates—there’s another, slightly slower paced way to use data: we can gather data over days or even weeks and then analyze it to determine whether the changes we have made to the web site have had the effects we expected them to.

There are different levels of awareness of data that different companies and teams within one company can have. Here at Etsy our search team, perhaps because of the inherently data-centric nature of their product, has been one of the leading users of data-driven development but the techniques they use can be applied to almost any kind of product development and we’re working to spread them throughout the company. In this post I’ll describe some of the different levels and finish up with some caveats about using science in product development.

Data? What’s that?

If you are building websites in the least scientific way possible your workflow probably goes like this: Think of a change we might make to the website (“I think I’ll change the background color on the Valentine’s Day page from white to pink”). This may be based on a theory about why that change makes sense (“It’ll put people in a more festive mood and they’ll buy more stuff”) or it may just “feel right”. In either case, you make the change and move on to your next idea. Unfortunately, people often disagree about whether something’s a good idea—everybody has an opinion and without any data there’s no way to escape the realm of pure opinion. The mark of a group operating at this level of no data use is the large amount of time people spend arguing about whether changes are worthwhile or not, both before and after the changes are made.

You know, we’ve got these web logs

Of course nobody really builds websites with quite this level of obliviousness since no one can resist the temptation to look at web logs and make up a story about what’s going on. A team operating at the next level of data awareness would think up an idea (pink background), make the change, and then look at the web logs to see how many people are buying things from the new, pink Valentine’s Day page. This is an improvement over the first level—if things change, at least you’ll know—but not much of one. The problem is that even though the data may tell you whether things are getting better or worse there’s almost no way to know whether the change is due to the pink background or some other factor. Maybe it’s getting closer to Valentine’s Day so people are more likely to buy Valentine’s Day related products. Or maybe over time, the user base has shifted to include people who are more likely to actually buy things. In other words there are any of a number of “confounding factors” that may be causing sales to go up or down, independent of the changed background. The mark of a team at this level is that the same amount of time is spent arguing but now the arguments are about how to interpret the data. Folks who think the change was good will argue that improving sales are the result of the change and declining sales due to something else while folks who don’t like the change will argue the opposite. But the arguments take just as long and are just as inconclusive.

Hey kids, let’s do “science”!

The obvious next step, for many people, is to try to control for the many confounding factors that result from the passage of time. Unfortunately there are a lot of wrong ways to do that. For instance, you might say, “Let’s use the pink background only on Thursdays and then compare the purchase rate on Thursday to that of rest of the week.” While this is a step in the right direction, and is probably easy to implement, it is not as useful as you might think. While you have controlled for some confounding factors, there are still many others that could still be intruding. Any difference that could possibly exist between the folks who visit on Thursday and those who come during the rest of the week could provide an alternate explanation of any difference in purchase rates. Just because there’s no obvious reason why Thursday visitors would be richer or more profligate or just more into Valentine’s Day than the average user, they might be and if they are it will confound the results. At this stage there are likely to be fewer useless arguments but there is a quite high chance of making bad decisions because the data don’t really say what folks think they say.

That’s so random

Luckily there’s one simple technique that neatly solves this problem. Instead of dividing visitors by what day they visit the site or some other seemingly arbitrary criteria, divide them by the one criteria which, by definition, is not linked to anything else: pure randomness. If every user is assigned to either the pink or white background at random, both groups should be representative of your total user base in all the respects you can think of and all the ones you can’t. If your total user base is 80% women, both groups will be approximately 80% women. If your user base has 1% of people who hate the color pink, so will both your experimental groups. And if your Thursday users, or any other subset, are for whatever reason more likely to buy than the average user, they will now be distributed evenly between both the pink and white background groups.

Another way to look at it is that if you observe a difference in purchasing rate between two randomly assigned groups then you only have to consider two possibilities: that the difference is actually due to the background color or that it’s due to random chance. That is, just as it’s possible to flip a coin ten times and get eight heads and two tails, it’s possible that the random assignment put a disproportionate number of big spenders in one group and that’s why the purchasing rates differ. However statistics provides tools that can tell you how likely it is that an observed difference is due to chance. For example, statistics can tell you something like: “If the background color in fact had no effect on purchasing rate, there’s a 0.002% chance that you would have seen the difference you did due to the random assignment of users to the two groups.” If the chance is small enough, then you can be quite confident that the only other possible explanation, that the background color had an effect, is true.

This is what data geeks mean when we say something is statistically significant: results are statistically significant when the chance of having achieved them by pure chance is sufficiently small. How small is sufficiently small is not fixed. One common threshold is 95% confidence which means that the chance of getting the results by chance has to be less than 5%. But note that at that threshold, one out of every twenty experiments where the change actually has no effect, will get a result that passes the threshold for statistical significance leading you to think the change did have an effect. This is called a false positive. You can control how many false positives you get by how you choose your threshold. With a higher confidence level, say 99%, you will have fewer false positives but will increase your rate of false negatives—times when the change did have an effect but the measured difference isn’t great enough to be considered statistically significant. In addition to the confidence level, the likelihood of a false negative depends on the size of the actual effect (the larger the effect, the less likely it is to be mistaken for random fluctuation) and the number of data points you have (more is better).

There are, however, two things to note about statistical significance. One is that there is no law that says what confidence level you have to use. For many changes 95% might be a good default. But imagine a change that is going to have a significant operational cost if rolled out to all users, maybe a change that will require expensive new database hardware. In such a case you might want to be 99% or even 99.9% confident that any observed improvement is real before deciding to turn on the feature for everyone.

The second thing to keep in mind is that statistical significance is not the same as practical significance. With enough data you can find statistically significant evidence for tiny improvements. But that doesn’t mean the tiny improvements are practically useful. Again, consider a feature that will cost real money to deploy. You may have highly statistically significant evidence that a change gives somewhere between a .0001% and .00015% improvement in conversion rate, but if even the top end of such a small change won’t produce enough new revenue to outweigh the cost of the feature you probably shouldn’t roll it out.

Any group that understands these issues and is running randomized tests is truly using data to understand the effects of their changes and to guide their decisions. The marks of a team operating at this level are that at least some people spend their time arguing about statistical arcana—what is the proper alpha level, Bayesian vs frequentist models, and exactly what statistical techniques to use for hypothesis testing—as well as coming up with more sophisticated statistical analyses to glean more knowledge from the data gathered.

Real Science™

Note, however, that no matter how fancy the analysis, the only knowledge built is about the effect of each particular change. Each new change requires starting all over—coming up with an idea, testing it with randomly assigned groups of users, and analyzing the data to see if it made things better or worse.

Which may not be too bad. If you have reasonably good intuition about what kind of changes are likely to be improvements, and you aren’t running out of ideas for things to do, you may be able to get pretty far with that level of data-driven development. But just as using data to decide what changes are actually successful is often better than relying on people’s intuition, you can also use data to help improving people’s intuition by by providing them with actual knowledge about what kinds of things generally do and do not work. To do this you need to follow the classical cycle of the scientific method:

  1. Ask a question
  2. Formulate a hypothesis
  3. Use the hypothesis to make a prediction
  4. Run experiments to test the prediction
  5. Analyze the results, refining or rejecting the hypothesis
  6. Communicate the results
  7. Repeat

For instance, you might ask a question like, what kind of images make people more likely to purchase things on my site? One hypothesis might be that pictures of people, as opposed to pictures of inanimate objects, make visitors more likely to purchase. An obvious prediction to make would be, adding a picture of a person to a page will increase the conversion rate for users who visit that page. And the prediction is easy to test: pick a page where it would make sense to add a picture of a person, do a random A/B test with some users seeing the person and others some other kind of picture, and measure the conversion rate for that page.

If the results of the experiment are in line with your prediction that’s good but you’re not done—next you want to replicate the result, maybe on a different kind of page. And you may also want to test other predictions that follow from the same hypothesis—does removing pictures of people from a page depress the conversion rate? You can also refine your hypothesis—perhaps the picture must include a clearly visible face. Or maybe you’ll discover that pictures of people increase sales only in certain contexts.

Whatever you discover you then need to communicate your results—write them down and make it part of your institutional knowledge about how to build your web site. It’s unlikely that you’ll end up with any equations as simple as E = mc2 or F = ma that can guide you unerringly to profitability or happier users or whatever it is you care most about. But you can still build useful knowledge. Even if you just know things like, “in these contexts pictures of people’s faces tend to increase click through” you can use that knowledge when designing new features instead of being doomed to randomly wander the landscape of possible features with only local A/B tests to tell you whether you’re going up or down.

The limits of science

Science, however, cannot answer all questions and it’s worth recognizing some of its limitations. Indeed science is neither necessary nor sufficient: Jonathan Ive at Apple, working with his hand-picked team of designers behind the tinted windows of a lab from which most Apple employees were barred, didn’t—so far as we know—use A/B testing to design the iPhone. And Google, for all their prowess at marshaling data, has not been able A/B test their way to a successful social networking product.

Here are a few of the practical problems with trying to design products purely with science:

A/B testing is a hill climbing technique If you imagine all the variants of all the features you could possibly build arranged on a landscape, with better variants at higher altitude, then A/B testing is a great way to climb toward better versions of an existing feature. But once we’re at a local maximum, every direction leads downhill. The only way off a local maximum is to leap into space and try to land (or hope that you do) somewhere on a gradient that leads to a higher peak.

If you do leap off toward a new hill, you can’t use A/B testing to compare where you land to where you were because most likely you’ll land somewhere below the peak of the new hill and quite possibly lower than where you were on the old hill. But if the new peak is actually higher than the old one, you’d do well to stay on the new hill and use A/B testing to climb to the higher local maximum. Once we’ve made the leap, science and experiments definitely come back into play but they’re of limited use in making the leap itself.

A/B testing is only as good as our A and B ideas Another way to look at A/B testing is as a technique for comparing ideas. If neither the A nor B idea is very good, then A/B testing will just tell you which sucks less. Knowledge built with experiments, will—we hope—help designers navigate the space of possible ideas but science doesn’t say a lot about where those good ideas ultimately come from. Even in “real” science exactly how scientists come up with interesting questions and good hypothesis is a bit mysterious.

You can’t easily learn about people who aren’t using your product The basis for statistical inference is the sample. We learn about a larger population by measuring the behavior of a random sample from that population. If we want to learn about the behavior of current Etsy users we’re all set—run an experiment with a random sample of users and draw our inferences. But what if we want to know whether a new feature will expand our user base? It’s harder (though not necessarily impossible) to get a random sample of people who aren’t currently using Etsy but who might be interested in some as-yet-undeveloped Etsy feature. Consequently, there’s a real risk of looking under the streetlamp (running experiments on our current users) because the light is better, rather than looking somewhere where we might actually find the keys to a hot new car.

Finally there’s one larger structural problem with the scientific method captured by a famous paraphrase of something Max Planck, founder of quantum theory and the 1918 physics Nobel laureate said: “Science progresses one funeral at a time.” In other words, it can be difficult to let go of hard-won scientific “truths”. After you’ve gone to all the trouble to ask an interesting question, devise a clever hypothesis, and design and run the painstaking experiments that convince everyone that your hypothesis is correct, you don’t really want to hear about how your hypothesis doesn’t fit with some new bit of data and is bound to be replaced by some new theoretical hotness. Or as Arthur C. Clarke put it in his First Law:

When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.

In product development, instead of distinguished but elderly scientists we have experienced product managers. But we have the same risk of holding on too long to hard-won “scientific” truths. If we do everything right, running experiments to test all our hypothesis about how the product works and even making occasional bold leaps into new territory to escape the trap of A/B incrementalism, eventually accumulated “scientific” knowledge may become a liability. And the danger is even worse in product development than it is in “real” science since we’re not dealing with universal and timeless truths. Indeed, each product you launch (to say nothing of what the rest of the world is doing) changes the landscape into which new products will be launched. What you learned yesterday may no longer apply tomorrow, in large part because of what you do today. Properly applied, science can lead us out of such difficulties but it’s worth noting that it was science, by allowing us to know things at all, that can mislead us into thinking we know a bit more than we really do.

So what to do?

Science is powerful and it’d be silly to abandon it entirely just because there are ways it can go wrong. And it’d be equally silly to think that science can somehow replace strong, talented product designers. The goal should be to combine the two.

The strength of the lone genius designer approach (c.f. Apple) is that when you actually have a genius you get a lot of good ideas and can take giant leaps to products that would never have been arrived at by incremental revisions of anything already in the world. Furthermore, with a single designer imposing their vision on a product, it is more likely to have a coherent, holistic design. On the other hand, if you give your lone genius completely free rein then there is no protection from their occasional bad ideas.

The data-driven approach, (c.f. Google), on the other hand, is much stronger on detecting the actual quality of ideas—bad ideas are abandoned no matter who originally thought they were good and the occasional big wins are clearly identified. But, as noted above, it doesn’t have much to say about where those ideas come from. And it can fall into the trap of focusing on things that are easy to measure and test—don’t know what shade of blue to use? A/B test 41 different shades. (An actual A/B test run at Google.)

Ultimately the goal is to make great products. Great ideas from designers are a necessary ingredient. And A/B testing can definitely improve products. But best is to use both: establish a loop between good design ideas leading to good experiments leading to knowledge about the product leading to even better design ideas. And then allow designers the latitude to occasionally try things that can’t yet be justified by science or even things that my go against current “scientific” dogma.

10 responses

Intrinsically Motivated Full Stack Product Hacker

Posted by Jason Davis | Filed under people

Photo by kerrdelune

Thales of Miletus was said to have been so intent on watching the stars that he fell into a well. Herein lies an invitation to join those of us, quite comfortable with this as an occupational hazard, happily contemplating from the bottom.

Etsy seeks full-stack, whole-site product hackers. These will be talented and intrinsically motivated individuals that can operate without supervision. How will you know that the job is right for you?

You consider critical thinking to be among your core competencies. Rigor is important to you. You are prone to quixotic behavior. You dedicate time and effort communicating nuance to Manichaeans. Honesty, integrity, and a firm grip on reality are much more important to you than being right. You recognize the inherent limitations of your own wetware, and you do your best to work around them. You find argument exhilarating.

You have mastered more than one technology stack in your life, and you are comfortable with the idea of mastering several more. You’re competent at high levels and low levels. But technology is a means and not the ends for you, and you don’t flinch at the idea of writing largely PHP for a living. We’re here to change people’s lives, and in that light whether this stupid thing is a LISP-1 or a LISP-2 seems irrelevant. You do know the difference between those categories, and when you’re wearing a different hat you care about it.

Most importantly for this job, it never even occurs to you that there might be someone else you could bug to fix your immediate problem. You take no dependencies. Not only are you not married to your implementation, you can name dozens of things that you frankly hate about it.

You are thoughtful, and you ship.

http://www.etsy.com/full-stack-product-hacker

(copy by mcfunley)

9 responses

Navigating Etsy Support

Posted by Joe | Filed under people, support

At Etsy we get about 10,000 emails a week from our members.  Since Etsy depends on the community, it’s important for us to respond quickly with detailed answers.  Additionally, we have to address the problem of spammers.  Previously we used a third party system for managing our support.  Quickly we found that in order to provide service of the highest possible quality, we had to build our own.  So we did!

The Compass Inbox (names were randomly generated for this example)

Introducing Compass, our tool for navigating the wide range of emails we get.  This tool manages the inboxes for many different departments.  Generally, it’s designed for communications that resemble a “Task.”  Tasks encapsulate email threads, and have a state of “open” or “closed.”  If we respond to an email asking for more information, we can leave the Task open, indicating that member still needs attention.  Tasks also support assignment, so that as emails come into a general mailbox, any Etsy employee may assign it to themselves indicating they will answer it.  Any future communication with that member will then automatically go to the same employee.

Since Tasks are the cornerstone to the operation of Compass, it’s very easy to use the system for many internal needs.  Recently, we just added a mailbox for the security team to receive feedback from the community on suspected security weaknesses.

The Knowledge Base ensures that we have consistent, detailed answers to email.  This was developed as a separate tool, which Compass calls using REST.  The Knowledge Base has over 1,500 answers to common problems.  These answers, or “articles,” are translated into several languages, and also have different versions for different mediums such as email, twitter, forums, and chat.

Compass leverages the content of the Knowledge Base by allowing a support agent to fill an email with the most appropriate article.  They usually do this using a shortcut, allowing them to give detailed, comprehensive answers very quickly.  Agents can then customize the email further to make it personal.  The benefits of this system go far beyond high quality responses: we track which articles are used, giving us easy access to metrics showing what people are asking about.  This allows us to track current trending issues by the hour and identify longer term product issues our members continually struggle with.

Currently, our most used article (by a somewhat large margin,) is “Contact seller.”  This article informs the member that they should direct their question to the seller of the item they’re asking about.  It shows us that it is difficult for many users of the site to distinguish between Etsy and the independent shop owner.  We’re always improving the site to provide a better experience in response to this collective feedback.

As with all things, the devil is in the details.  How should we handle spam?  Also, members often use attachments, so how should we handle malware?  We’ve used the same solution for both problems: Google.  All of our email accounts are actually GMail accounts in Google Apps.  Google handles all of the spam filtering for us, then we retrieve the legitimate emails over IMAP.  After we’ve parsed and stored the message, we upload any non-image attachments to Google Docs, which provides two great functions:  it checks for malware, and provides an in-browser view of the attachment.

It took a serious engineering effort to build this in-house support system, but the benefits have justified the costs.  Most days, we have a turnaround time of 12 hours or less, depending on volume.  We’re able to segment support into specialized teams, maximizing quality and efficiency.  We have detailed information about our users when viewing an email thread.  Notes left on an email thread are available across all tools.  We can create Tasks in Compass beyond email, tying in the “Contact Us” page and other system-generated messages.  Personally, I think the most important part about owning our own tool is control over the interface.  Many Etsy employees use Compass for 8 hours a day.  Even the smallest UI improvement makes a huge impact for over 50 users on 5 teams in 4 countries.

This was just a brief overview of what we’re doing with our internal tools.  Stay tuned for more specific posts about decisions we made, and things we learned.  Next up we’ll talk about the evolution of our javascript architecture.

If you’re interested in developing internal tools at Etsy, where your customer sits next to you and will come give you a hug when you release a new feature, the Marketplace Operations Engineering team is hiring.  Check out www.etsy.com/careers.

3 responses

Robots, Graphs, and Binary Search

Posted by Nick Galbreath | Filed under engineering, operations, security

We love our human customers. That said we get a lot of traffic from robots too. This is great, especially when they use the Etsy API. However, they sometimes misbehave. And they misbehave frequently in the late hours, not unlike a legendary East Village night club. This time Craig Ferguson isn’t at the door to keep an eye on things. Instead we monitor the robots attending our night club with graphs:


To do this we first compiled a list of where the robots live. This includes datacenters, places that make search engines and crawlers, and companies that lease out CPUs by the slice. The full list has over 1800 records, but here’s a sample:

So given this list, and an incoming IP address, how can one quickly tell who it belongs too, if anyone? Linear scan? No. Binary Search? Yes, and we like binary search.

To do this, the raw file is converted into a static PHP file. It converts all the dotted IPv4 addresses into the equivalent plain integers and turns all the data into an PHP array. As part of the file, there is a find method that uses a binary search algorithm modified to work with disjoint intervals. Can you spot the difference?

<?php
/* Autogenerated.  Do not edit */
class IpDb {
    public static function find($ipstr) {
        $ip = ip2long($ipstr);
        $haystack = self::$db;
        $high = count($haystack) - 1;
        $low = 0;
        while ($high >= $low) {
            $probe = floor(($high + $low) / 2);
            $row = $haystack[$probe];
            if ($row['_ip0'] > $ip) {
                $high = $probe - 1;
            } else if ($row['_ip1'] < $ip) {
                $low = $probe + 1;
            } else {
                return $row;
            }
        }
        return null;
    }
    static public $db = array (
  0 => 
  array (
    '_ip0' => 34603008,
    '_ip1' => 35127295,
    'owner' => 'http://akamai.com/',
  ),
  1 => 
  array (
    '_ip0' => 134488576,
    '_ip1' => 134489087,
    'owner' => 'http://www.peakwebhosting.com/',
  ),
  2 => ... etc...

The resulting file is pushed out in the same way all our other code is. Using this mini WHOIS database is just a PHP function call without any database or network access. The same technique can easily be done in other languages or used for other applications involving date-time intervals.

Fraud Engineering presentation thumbnailThis is just one of many ways we monitor the site for bad behavior (robots or otherwise). Seeing how much traffic comes from where “hands aren’t on keyboard” has given us many insights into how the site is used. Some more details on how this data is used can be found in presentation on Fraud Engineering given at the Merchant Risk Council conference in March.

Finally, let’s not forget that not all robots are bad.

Update: as soon as I posted this, my colleague Jerry Soung informed he implemented the same thing for a different IP database. Binary search FTW.

7 responses

Blameless PostMortems and a Just Culture

Posted by jallspaw | Filed under engineering, people

Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought I might give some detail on how that actually happens, and why.

Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you slave over, the code you write and review, or the alerts and metrics you meticulously pore through.

So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?

Maybe they should be fired. Or maybe they need to be prevented from touching the dangerous bits again. Or maybe they need more training.

This is the traditional view of “human error”, which focuses on the characteristics of the individuals involved. It’s what Sidney Dekker calls the “Bad Apple Theory” – get rid of the bad apples, and you’ll get rid of the human error. Seems simple, right?

We don’t take this traditional view at Etsy. We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.

A Blameless Post-Mortem

What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

We believe that this detail is paramount to improving safety at Etsy.

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

The base fundamental here is something Erik Hollnagel has said:

We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.


A Second Story

This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 

Allowing Engineers to Own Their Own Stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

So what do we do to enable a “Just Culture” at Etsy?

  • We encourage learning by having these blameless Post-Mortems on outages and accidents.
  • The goal is to understand how an accident could have happened, in order to better equip ourselves from it happening in the future
  • We seek out Second Stories, gather details from multiple perspectives on failures, and we don’t punish people for making mistakes.
  • Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
  • We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
  • We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight.
  • We accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it.
  • We accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents.
  • We strive to make sure that the blunt end of the organization understands how work is actually getting done (as opposed to how they imagine it’s getting done, via Gantt charts and procedures) on the sharp end.
  • The sharp end is relied upon to inform the organization where the line is between appropriate and inappropriate behavior. This isn’t something that the blunt end can come up with on its own.

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”
Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.

20 responses

Two Sides For Salvation

Posted by Arie Kachler | Filed under databases, infrastructure, operations

How do you make changes to your database’s structure that’s getting hammered 24×7 without any disruption? If you use Oracle and paid millions for it, it’s built in. If you use Mysql, it’s one of the holy grails of database operations, and one we’ve learned to do here at Etsy.

We have a sharded architecture, which means data is scattered across several “shards”. Each shard has different data than all others. Each shard is a master-master pair. MM pairs are masters and slaves at the same time. They not only give you fault tolerance, they divide the read and write load between them that’s impossible to do in the common master-slave(s) setup. MM pairs have their own set of challenges.

Don’t Let Your Database Generate Anything

The main problem with MM pairs is caused by non-deterministic values generated by the database engine itself, such as autoincrement fields, random numbers and timestamps. The solution to that is that we don’t let the db generate anything. Every value inserted/modified always comes from the application. This allows us to write to either of the two sides of a MM pair knowing it will get replicated to the other side correctly. I’ve heard that MM pairs don’t make sense since you’re executing everything twice. It’s true that you are executing everything twice, but you’re doing it already if you’re using a master-slave(s) setup, and the benefits that come from MM pairs are huge. In addition to giving you fault tolerance and load balancing, they are the key to being able to do non-disruptive, live schema changes.

The other part of the puzzle is our shard-aware software layer: our in-house built ORM. It does many different things, but for our current topic, it finds where in our shards a particular object’s data lives. Whenever we need to access the data for an object, the ORM first goes to one of two “index” servers we have, then go to the shard that has the needed data. These index servers are also a MM pair. Index servers get a very large amount of queries, but they are all extremely fast, all in the order of 10-100𝜇s. It’s common for sharded architectures not to have an index server. You simply decide on a sharding scheme when you start, say by user id, then divide the data among your shards knowing where ranges of users live. Everything works great until the number of users on a shard grows beyond what one shard can handle, and by then you’re already in trouble. By having an index server, we can move data between shards and simply update the index to point to the new location.

Our ORM reads a configuration file when it starts, that among other things, contains the list of shard servers available to it. We can add shards as needed with time and add them to the configuration file to start writing data to them, also migrating users so new shards are not idle at first and to balance the load among all shards.

The kicker: when we do schema changes, we take out one server from each of the MM pairs from the configuration file and gracefully restart the application. The ORM re-reads its configuration and knows only about the active shard sides. This leaves the application running on half of our database servers. Nobody notices. We immediately see in our many graphs that one side’s traffic plummets and the other side is taking all the load.

Note that half of the servers does not mean half of the data. All data lives on both sides of a MM pair. Replication is still going both ways, we never break it. The active side simply stops getting inserts/updates/deletes from the inactive side because nothing is connecting to it. But the inactive side still gets inserts/updates/deletes from the active side since it’s still a slave. We could break replication for the ALTERs, but there’s no benefit in doing so and adds an unnecessary step (with the one exception of the session we’re actively doing ALTERs in. We don’t want those to replicate.)
At this point we are ready to make as many changes as we need on the inactive side. In Mysql terms, ALTERs. These ALTERs can take anywhere from minutes to hours to complete and lock the tables they are modifying, but we’re operating on the inactive side and definitely don’t want any of our work to replicate to the other side, so we prepend ALTERs with SET SQL_LOG_BIN=0.
When these alters are done, they have been applied to the inactive side only.
Another change in the config file places these servers back into active mode. We wait for load to stabilize between both sides, replication to catch up if it has lagged behind, then we’re ready to repeat for the side that hasn’t been ALTER’ed.

Taking sides out of production is not only useful for schema changes, but for upgrades, configuration changes, and any other necessary downtime.

So this is all great, works well for us. We routinely do schema changes with no user impact. But what if you don’t have an ORM? Mysql Proxy may be your answer. It’s very simple to have web servers connect to a pool of available backend database servers with Mysql Proxy. You can read the documentation for it at Mysql’s website. An important feature of Mysql Proxy is that it allows you to change configuration on-the-fly, so you can take servers in and out without even having to stop or reload your application.

MM pairs have had a bad reputation of being quirky. They can be, but as long as you don’t let your database generate anything, they work. When you need to do frequent schema changes in a 24×7 environment, they are key to no-downtime schema changes.

If you want more details on our database architecture, you can also check here

13 responses

Etsy Hacker Grants: Supporting Women in Technology

Posted by Kellan Elliott-McCrea | Filed under Uncategorized

“Today, in conjunction with Hacker School, Etsy is announcing a new scholarship and sponsorship program for women in technology: we’ll be hosting the summer 2012 session of Hacker School in the Etsy headquarters, and we’re providing ten Etsy Hacker Grants of $5,000 each — a total of $50,000 — to women who want to join but need financial support to do so. Our goal is to bring 20 women to New York to participate, and we hope this will be the first of many steps to encourage more women into engineering at Etsy and across the industry.” – find out more, and then checkout Etsy Hacker Grants

4 responses

Kernel Debugging 101

Posted by avleenetsy | Filed under engineering, operations

A dark fog had been rolling in that night, and we had been setting up a new cluster of servers for our CI system. CentOS 6.2, LXC and random kernel panics were all there to lend a hand. The kernel panics were new to our party, having been absent at the previous cluster setup.

The first set of servers we installed had been running happily, however this new set were not. They would always kernel panic under the slightest load, and sometimes without much load at all.

A nice feature, which is enabled by default in CentOS 6, allows your kernel to dump a core file when it panics. When your system comes back, you can retrieve this core file, examine it and work out what happened. This is a short story on how we did it. The work we did is on CentOS but it can easily be applied to other Linux distributions too.

Basics and theory

When a kernel panics, it dumps a core file into /var/crash/, which can then be examined.
In your tool belt, you need:

  • /usr/bin/crash (installed through the “crash” package)
  • A debuginfo kernel (downloaded and installed from http://debuginfo.centos.org/6/x86_64/)
  • The vmcore file from /var/crash/....

crash is gdb-like. It uses gdb and lets you examine vmcore files from kernels.

Basic debugging

Start crash by pointing it at the vmlinux file installed by the debuginfo kernel, and the vmcore file:

sudo crash /usr/lib/debug/lib/modules/`uname -r`/vmlinux \
/var/crash/<time>/vmcore

You will see output like this:

      KERNEL: /usr/lib/debug/lib/modules/<kernel>/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2012-03-28-23:51:01/vmcore
  [PARTIAL DUMP]
        CPUS: 16
        DATE: Wed Mar 28 23:50:56 2012
      UPTIME: 00:23:26
LOAD AVERAGE: 0.95, 1.45, 1.01
       TASKS: 986
    NODENAME: buildtest07
     RELEASE: 2.6.32-220.7.1.el6.x86_64
     VERSION: #1 SMP Wed Mar 7 00:52:02 GMT 2012
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 24 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 0
     COMMAND: "swapper"
        TASK: ffff880337eb9580  (1 of 16)
[THREAD_INFO: ffff880637d18000]
         CPU: 12
       STATE: TASK_RUNNING (PANIC)

This tells us some important bits of information:

  • The command being run was swapperswapper is a kernel process responsible for scheduling time on the CPU. When a panic happens here, there’s a likelihood that we’re looking at something in the kernel space that broke, rather than something in user space.
  • The panic was an Oops.
  • There are dates, times, number of running processes and other handy information too.

If you now run:

crash> log

and jump to the end, you will see much longer output like this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
IP: [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0
PGD 10e0fe067 PUD 10e0b0067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-6/removable
CPU 12
Modules linked in: veth bridge stp llc e1000e serio_raw
                   i2c_i801 i2c_core sg iTCO_wdt
                   iTCO_vendor_support ioatdma dca i7core_edac
                   edac_core shpchp ext3 jbd mbcache sd_mod
                   crc_t10dif ahci dm_mirror dm_region_hash
                   dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted <kernel> #1 Supermicro X8DTT-H/X8DTT-H
RIP: 0010:[<ffffffff8142bb40>]  [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0
RSP: 0018:ffff88034ac83dc0  EFLAGS: 00010246
RAX: 0000000000000060 RBX: ffff8805353896c0 RCX: 0000000000000000
RDX: ffff88053e8c3380 RSI: 0000000000000286 RDI: ffff8805353896c0
RBP: ffff88034ac83e10 R08: 00000000000000c3 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000015 R14: ffff88034ac93770 R15: ffff88034ac93784
FS:  0000000000000000(0000) GS:ffff88034ac80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000060 CR3: 000000010e130000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff880637d18000,
task ffff880337eb9580)
Stack:
 ffffc90013e37000 ffff880334bdc868 ffff88034ac83df0 0000000000000000
<0> ffff880334bdc868 ffff88034ac93788 ffff88034ac93700 0000000000000015
<0> ffff88034ac93770 ffff88034ac93784 ffff88034ac83e60 ffffffff8142c25a
Call Trace:
 <IRQ>
 [<ffffffff8142c25a>] process_backlog+0x9a/0x100
 [<ffffffff814308d3>] net_rx_action+0x103/0x2f0
 [<ffffffff81072001>] __do_softirq+0xc1/0x1d0
 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c24c>] call_softirq+0x1c/0x30
 [<ffffffff8100de85>] do_softirq+0x65/0xa0
 [<ffffffff81071de5>] irq_exit+0x85/0x90
 [<ffffffff814f4dc5>] do_IRQ+0x75/0xf0
 [<ffffffff8100ba53>] ret_from_intr+0x0/0x11
 <EOI>
 [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170
 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170
 [<ffffffff813fa027>] cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009e06>] cpu_idle+0xb6/0x110
 [<ffffffff814e5ffc>] start_secondary+0x202/0x245
Code: 00 44 8b 1d cb be 79 00 45 85 db 0f 85 61 06 00 00 f6 83 b9
      00 00 00 10 0f 85 5d 04 00 00 4c 8b 63 20 4c 89 65 c8 49 8d
      44 24 60 <49> 39 44 24 60 74 44 4d 8b ac 24 00 04 00 00 4d
      85 ed 74 37 49
RIP  [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0
 RSP <ffff88034ac83dc0>
CR2: 0000000000000060

At this level the most interesting thing to note is the line starting with BUG:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000058

This tells us why the panic happened: A NULL pointer dereference.

Intermediate debugging

Now some more gnarly stuff. If you’re still here, you’re pretty damn brave.
There some cool things we can do in crash which might help us.
For example, we can look at the list of running processes:

   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      0      0   0  ffffffff81a8d020  RU   0.0       0      0  [swapper]
>     0      0   1  ffff880638628a80  RU   0.0       0      0  [swapper]
>     0      0   2  ffff880337d934c0  RU   0.0       0      0  [swapper]
>     0      0   3  ffff8806386294c0  RU   0.0       0      0  [swapper]
>     0      0   4  ffff880337dd3580  RU   0.0       0      0  [swapper]
>     0      0   5  ffff880637c84080  RU   0.0       0      0  [swapper]
>     0      0   6  ffff880337df1540  RU   0.0       0      0  [swapper]
>     0      0   7  ffff880637c84ac0  RU   0.0       0      0  [swapper]
>     0      0   8  ffff880337e33500  RU   0.0       0      0  [swapper]
      0      0   9  ffff880637c85500  RU   0.0       0      0  [swapper]
>     0      0  10  ffff880337e774c0  RU   0.0       0      0  [swapper]
>     0      0  11  ffff880637cf40c0  RU   0.0       0      0  [swapper]
>     0      0  12  ffff880337eb9580  RU   0.0       0      0  [swapper]
>     0      0  13  ffff880637cf4b00  RU   0.0       0      0  [swapper]
>     0      0  14  ffff880337ed7540  RU   0.0       0      0  [swapper]
>     0      0  15  ffff880637cf5540  RU   0.0       0      0  [swapper]
      1      0   8  ffff880638628040  IN   0.0   21364   1568  init
      2      0   8  ffff880337c714c0  IN   0.0       0      0  [kthreadd]
      3      2   0  ffff880337c70a80  IN   0.0       0      0  [migration/0]
      4      2   0  ffff880337c70040  IN   0.0       0      0  [ksoftirqd/0]
      5      2   0  ffff880337c99500  IN   0.0       0      0  [migration/0]
....

The list continues on for another 950 lines. The lines starting with a > indicate the processes currently active on a CPU. We are working on a 16 core system, and 14 of them are currently in kernel mode tending to CPU scheduling duties. (The other 2 further down were gmond and php.)

You can also change the context you’re currently in. For example, currently we’re looking at the CPU where the kernel panic happened. However, we can switch to another CPU if we wanted to using set -c <CPU> and examine things there. (We don’t want to do this yet.)

Speaking of our kernel panic! Let’s get back to that..
The code at the end of the previous section tells us a null pointer dereference was the cause of the crash, and it gives us some clues as to what happened:

IP: [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0

Here’s the code that was being executed! __netif_receive_skb, this sounds like it might have something to do with receiving things on the network interface, if I had to guess! (and if I did, I’d be right – many things in the kernel are named to be obvious).

We could even grep through the kernel source to find this!

shell> ack -a __netif_receive_skb
net/core/dev.c
    2705:int __netif_receive_skb(struct sk_buff *skb)

Looking at line 2705 of net/core/dev.c shows us the method that was running when things broke down.

Advanced debugging

Are you really still reading this? You’ve just gone from brave, to foolish! There is no turning back! Here’s the rabbit hole:

There’s a lot more information here! Just because we know the method that was invoked doesn’t help us that much. The method could be 2 lines and the solution could be clear, but more likely the method will be 100 lines long and make many references to other things. So let’s start poking at the memory to see what specifically happened:

IP: [<ffffffff8142bb40>] __netif_receive_skb+0x60/0x6e0

There are two each bits of information in this line:

  1. ffffffff8142bb40 is the memory address at which the problem occurred.
  2. 0x60 is the hex offset for the line in the memory that caused the problem. 0x60 is 96 in decimal – remember this!
    We can inspect the memory by doing this:

    crash> dis -rl ffffffff8142bb40
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2706
    0xffffffff8142bae0 <__netif_receive_skb>:       push   %rbp
    0xffffffff8142bae1 <__netif_receive_skb+1>:     mov    %rsp,%rbp
    0xffffffff8142bae4 <__netif_receive_skb+4>:     push   %r15
    0xffffffff8142bae6 <__netif_receive_skb+6>:     push   %r14
    0xffffffff8142bae8 <__netif_receive_skb+8>:     push   %r13
    0xffffffff8142baea <__netif_receive_skb+10>:    push   %r12
    0xffffffff8142baec <__netif_receive_skb+12>:    push   %rbx
    0xffffffff8142baed <__netif_receive_skb+13>:    sub    $0x28,%rsp
    0xffffffff8142baf1 <__netif_receive_skb+17>:    nopl   0x0(%rax,%rax,1)
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2715
    0xffffffff8142baf6 <__netif_receive_skb+22>:    cmpq   $0x0,0x18(%rdi)
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2706
    0xffffffff8142bafb <__netif_receive_skb+27>:    mov    %rdi,%rbx
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2715
    0xffffffff8142bafe <__netif_receive_skb+30>:    jne    0xffffffff8142bb16 <__netif_receive_skb+54>
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../arch/x86/include/asm/atomic_64.h: 23
    0xffffffff8142bb00 <__netif_receive_skb+32>:    mov    0xbdcbae(%rip),%eax        # 0xffffffff820086b4
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 1364
    0xffffffff8142bb06 <__netif_receive_skb+38>:    test   %eax,%eax
    0xffffffff8142bb08 <__netif_receive_skb+40>:    jne    0xffffffff8142c008 <__netif_receive_skb+1320>
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 1367
    0xffffffff8142bb0e <__netif_receive_skb+46>:    movq   $0x0,0x18(%rdi)
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../include/trace/events/net.h: 68
    0xffffffff8142bb16 <__netif_receive_skb+54>:    mov    0x79becb(%rip),%r11d        # 0xffffffff81bc79e8
    0xffffffff8142bb1d <__netif_receive_skb+61>:    test   %r11d,%r11d
    0xffffffff8142bb20 <__netif_receive_skb+64>:    jne    0xffffffff8142c187 <__netif_receive_skb+1703>
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../net/core/dev.c: 2719
    0xffffffff8142bb26 <__netif_receive_skb+70>:    testb  $0x10,0xb9(%rbx)
    0xffffffff8142bb2d <__netif_receive_skb+77>:    jne    0xffffffff8142bf90 <__netif_receive_skb+1200>
    /usr/src/debug/kernel-2.6.32-220.7.1.el6/.../include/linux/netpoll.h: 86
    0xffffffff8142bb33 <__netif_receive_skb+83>:    mov    0x20(%rbx),%r12
    0xffffffff8142bb37 <__netif_receive_skb+87>:    mov    %r12,-0x38(%rbp)
    0xffffffff8142bb3b <__netif_receive_skb+91>:    lea    0x60(%r12),%rax
    0xffffffff8142bb40 <__netif_receive_skb+96>:    cmp    %rax,0x60(%r12)

WOW! Lots of output!

What you’re looking at is the full list of steps that took place, starting with the method that was invoked and ending with the fault.
If you look at the last line of the output, you should recognise two key things:

0xffffffff8142bb40 <__netif_receive_skb+96>:    cmp    %rax,0x60(%r12)
  1. The memory address is the one we peeked
  2. The number 96 is the decimal offset in the memory where the fault occurred.
    A few lines above this, we see a new filename and line number mentioned, include/linux/netpoll.h:86.
    Congratulations! This is the actual line which caused the NULL pointer dereference!

But what is it actually DOING?

cmp    %rax,0x60(%r12)

Let’s examine this bit. In assembly, cmp is called to compare two registers. %rax and %r12 are two CPU registers which we want to compare. If we got back to our log output, we see the registers:

RAX: 0000000000000060 RBX: ffff8805353896c0 RCX: 0000000000000000
RDX: ffff88053e8c3380 RSI: 0000000000000286 RDI: ffff8805353896c0
RBP: ffff88034ac83e10 R08: 00000000000000c3 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000

RAX has some data in it, but R12 is NULL!

Now take a look at code that was mentioned earlier:

 84 static inline int netpoll_receive_skb(struct sk_buff *skb)
 85 {
 86         if (!list_empty(&skb->dev->napi_list))
 87                 return netpoll_rx(skb);
 88         return 0;
 89 }

If you know linux networking internals, a few things can be understood from this:

  1. We’re dealing with New API (NAPI) code, based on the variable name napi_list
  2. NAPI is invoked when the system thinks there are a large number of interrupts being handled, and changing to RX POLLING would be more efficient
  3. sk_buff is the struct that holds data about linux sockets
  4. We’re checking if the list, &skb->dev->napi_list is empty, so we’re expecting it to be defined, and be of the right type.

What is probably going on here, is that expecting &skb->dev->napi_list to hold a list of incoming packets that need to be processed. Unfortunately, it turns out that something, most likely napi_list, is not set, causing the NULL pointed dereference.

Congratulations! You’ve found the problem. Finding what the cause of this is, however, will be done in a future tutorial.
In the meantime we have some options for workarounds:

  1. Try and test if napi_list is NULL just before the !list_empty check, and define it. This might have other unforeseen issues though – what if it causes a memory leak? It’s certainly not fixing the cause of the problem.
  2. Disable the use of NAPI by recompiling the network driver. This is a pretty simple operation and probably worth trying, especially if the system isn’t massively CPU bound and dropping packets.

We ended up going with option 2, as it was the fastest and most likely fix.

But we’ll still be sleeping with one eye open…

6 responses

« newer postsolder posts »

Recent Posts

About

Etsy At Etsy, our mission is to enable people to make a living making things, and to reconnect makers with buyers. The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.

Code as Craft is proudly powered by WordPress.com VIP and the SubtleFlux theme.

© Copyright 2013 Etsy