Building websites with science

Posted by on June 21, 2012

At Etsy, we want to use data to inform what we do. We have an awesome culture of obsessing over graphs and building dashboards to track our progress. But beyond tracking the effects of our changes from minute to minute via graphs on big screens—which is a great way to see if a change has had some dramatic effect on, say, performance or error rates—there’s another, slightly slower paced way to use data: we can gather data over days or even weeks and then analyze it to determine whether the changes we have made to the web site have had the effects we expected them to.

There are different levels of awareness of data that different companies and teams within one company can have. Here at Etsy our search team, perhaps because of the inherently data-centric nature of their product, has been one of the leading users of data-driven development but the techniques they use can be applied to almost any kind of product development and we’re working to spread them throughout the company. In this post I’ll describe some of the different levels and finish up with some caveats about using science in product development.

Data? What’s that?

If you are building websites in the least scientific way possible your workflow probably goes like this: Think of a change we might make to the website (“I think I’ll change the background color on the Valentine’s Day page from white to pink”). This may be based on a theory about why that change makes sense (“It’ll put people in a more festive mood and they’ll buy more stuff”) or it may just “feel right”. In either case, you make the change and move on to your next idea. Unfortunately, people often disagree about whether something’s a good idea—everybody has an opinion and without any data there’s no way to escape the realm of pure opinion. The mark of a group operating at this level of no data use is the large amount of time people spend arguing about whether changes are worthwhile or not, both before and after the changes are made.

You know, we’ve got these web logs

Of course nobody really builds websites with quite this level of obliviousness since no one can resist the temptation to look at web logs and make up a story about what’s going on. A team operating at the next level of data awareness would think up an idea (pink background), make the change, and then look at the web logs to see how many people are buying things from the new, pink Valentine’s Day page. This is an improvement over the first level—if things change, at least you’ll know—but not much of one. The problem is that even though the data may tell you whether things are getting better or worse there’s almost no way to know whether the change is due to the pink background or some other factor. Maybe it’s getting closer to Valentine’s Day so people are more likely to buy Valentine’s Day related products. Or maybe over time, the user base has shifted to include people who are more likely to actually buy things. In other words there are any of a number of “confounding factors” that may be causing sales to go up or down, independent of the changed background. The mark of a team at this level is that the same amount of time is spent arguing but now the arguments are about how to interpret the data. Folks who think the change was good will argue that improving sales are the result of the change and declining sales due to something else while folks who don’t like the change will argue the opposite. But the arguments take just as long and are just as inconclusive.

Hey kids, let’s do “science”!

The obvious next step, for many people, is to try to control for the many confounding factors that result from the passage of time. Unfortunately there are a lot of wrong ways to do that. For instance, you might say, “Let’s use the pink background only on Thursdays and then compare the purchase rate on Thursday to that of rest of the week.” While this is a step in the right direction, and is probably easy to implement, it is not as useful as you might think. While you have controlled for some confounding factors, there are still many others that could still be intruding. Any difference that could possibly exist between the folks who visit on Thursday and those who come during the rest of the week could provide an alternate explanation of any difference in purchase rates. Just because there’s no obvious reason why Thursday visitors would be richer or more profligate or just more into Valentine’s Day than the average user, they might be and if they are it will confound the results. At this stage there are likely to be fewer useless arguments but there is a quite high chance of making bad decisions because the data don’t really say what folks think they say.

That’s so random

Luckily there’s one simple technique that neatly solves this problem. Instead of dividing visitors by what day they visit the site or some other seemingly arbitrary criteria, divide them by the one criteria which, by definition, is not linked to anything else: pure randomness. If every user is assigned to either the pink or white background at random, both groups should be representative of your total user base in all the respects you can think of and all the ones you can’t. If your total user base is 80% women, both groups will be approximately 80% women. If your user base has 1% of people who hate the color pink, so will both your experimental groups. And if your Thursday users, or any other subset, are for whatever reason more likely to buy than the average user, they will now be distributed evenly between both the pink and white background groups.

Another way to look at it is that if you observe a difference in purchasing rate between two randomly assigned groups then you only have to consider two possibilities: that the difference is actually due to the background color or that it’s due to random chance. That is, just as it’s possible to flip a coin ten times and get eight heads and two tails, it’s possible that the random assignment put a disproportionate number of big spenders in one group and that’s why the purchasing rates differ. However statistics provides tools that can tell you how likely it is that an observed difference is due to chance. For example, statistics can tell you something like: “If the background color in fact had no effect on purchasing rate, there’s a 0.002% chance that you would have seen the difference you did due to the random assignment of users to the two groups.” If the chance is small enough, then you can be quite confident that the only other possible explanation, that the background color had an effect, is true.

This is what data geeks mean when we say something is statistically significant: results are statistically significant when the chance of having achieved them by pure chance is sufficiently small. How small is sufficiently small is not fixed. One common threshold is 95% confidence which means that the chance of getting the results by chance has to be less than 5%. But note that at that threshold, one out of every twenty experiments where the change actually has no effect, will get a result that passes the threshold for statistical significance leading you to think the change did have an effect. This is called a false positive. You can control how many false positives you get by how you choose your threshold. With a higher confidence level, say 99%, you will have fewer false positives but will increase your rate of false negatives—times when the change did have an effect but the measured difference isn’t great enough to be considered statistically significant. In addition to the confidence level, the likelihood of a false negative depends on the size of the actual effect (the larger the effect, the less likely it is to be mistaken for random fluctuation) and the number of data points you have (more is better).

There are, however, two things to note about statistical significance. One is that there is no law that says what confidence level you have to use. For many changes 95% might be a good default. But imagine a change that is going to have a significant operational cost if rolled out to all users, maybe a change that will require expensive new database hardware. In such a case you might want to be 99% or even 99.9% confident that any observed improvement is real before deciding to turn on the feature for everyone.

The second thing to keep in mind is that statistical significance is not the same as practical significance. With enough data you can find statistically significant evidence for tiny improvements. But that doesn’t mean the tiny improvements are practically useful. Again, consider a feature that will cost real money to deploy. You may have highly statistically significant evidence that a change gives somewhere between a .0001% and .00015% improvement in conversion rate, but if even the top end of such a small change won’t produce enough new revenue to outweigh the cost of the feature you probably shouldn’t roll it out.

Any group that understands these issues and is running randomized tests is truly using data to understand the effects of their changes and to guide their decisions. The marks of a team operating at this level are that at least some people spend their time arguing about statistical arcana—what is the proper alpha level, Bayesian vs frequentist models, and exactly what statistical techniques to use for hypothesis testing—as well as coming up with more sophisticated statistical analyses to glean more knowledge from the data gathered.

Real Science™

Note, however, that no matter how fancy the analysis, the only knowledge built is about the effect of each particular change. Each new change requires starting all over—coming up with an idea, testing it with randomly assigned groups of users, and analyzing the data to see if it made things better or worse.

Which may not be too bad. If you have reasonably good intuition about what kind of changes are likely to be improvements, and you aren’t running out of ideas for things to do, you may be able to get pretty far with that level of data-driven development. But just as using data to decide what changes are actually successful is often better than relying on people’s intuition, you can also use data to help improving people’s intuition by by providing them with actual knowledge about what kinds of things generally do and do not work. To do this you need to follow the classical cycle of the scientific method:

  1. Ask a question
  2. Formulate a hypothesis
  3. Use the hypothesis to make a prediction
  4. Run experiments to test the prediction
  5. Analyze the results, refining or rejecting the hypothesis
  6. Communicate the results
  7. Repeat

For instance, you might ask a question like, what kind of images make people more likely to purchase things on my site? One hypothesis might be that pictures of people, as opposed to pictures of inanimate objects, make visitors more likely to purchase. An obvious prediction to make would be, adding a picture of a person to a page will increase the conversion rate for users who visit that page. And the prediction is easy to test: pick a page where it would make sense to add a picture of a person, do a random A/B test with some users seeing the person and others some other kind of picture, and measure the conversion rate for that page.

If the results of the experiment are in line with your prediction that’s good but you’re not done—next you want to replicate the result, maybe on a different kind of page. And you may also want to test other predictions that follow from the same hypothesis—does removing pictures of people from a page depress the conversion rate? You can also refine your hypothesis—perhaps the picture must include a clearly visible face. Or maybe you’ll discover that pictures of people increase sales only in certain contexts.

Whatever you discover you then need to communicate your results—write them down and make it part of your institutional knowledge about how to build your web site. It’s unlikely that you’ll end up with any equations as simple as E = mc2 or F = ma that can guide you unerringly to profitability or happier users or whatever it is you care most about. But you can still build useful knowledge. Even if you just know things like, “in these contexts pictures of people’s faces tend to increase click through” you can use that knowledge when designing new features instead of being doomed to randomly wander the landscape of possible features with only local A/B tests to tell you whether you’re going up or down.

The limits of science

Science, however, cannot answer all questions and it’s worth recognizing some of its limitations. Indeed science is neither necessary nor sufficient: Jonathan Ive at Apple, working with his hand-picked team of designers behind the tinted windows of a lab from which most Apple employees were barred, didn’t—so far as we know—use A/B testing to design the iPhone. And Google, for all their prowess at marshaling data, has not been able A/B test their way to a successful social networking product.

Here are a few of the practical problems with trying to design products purely with science:

A/B testing is a hill climbing technique If you imagine all the variants of all the features you could possibly build arranged on a landscape, with better variants at higher altitude, then A/B testing is a great way to climb toward better versions of an existing feature. But once we’re at a local maximum, every direction leads downhill. The only way off a local maximum is to leap into space and try to land (or hope that you do) somewhere on a gradient that leads to a higher peak.

If you do leap off toward a new hill, you can’t use A/B testing to compare where you land to where you were because most likely you’ll land somewhere below the peak of the new hill and quite possibly lower than where you were on the old hill. But if the new peak is actually higher than the old one, you’d do well to stay on the new hill and use A/B testing to climb to the higher local maximum. Once we’ve made the leap, science and experiments definitely come back into play but they’re of limited use in making the leap itself.

A/B testing is only as good as our A and B ideas Another way to look at A/B testing is as a technique for comparing ideas. If neither the A nor B idea is very good, then A/B testing will just tell you which sucks less. Knowledge built with experiments, will—we hope—help designers navigate the space of possible ideas but science doesn’t say a lot about where those good ideas ultimately come from. Even in “real” science exactly how scientists come up with interesting questions and good hypothesis is a bit mysterious.

You can’t easily learn about people who aren’t using your product The basis for statistical inference is the sample. We learn about a larger population by measuring the behavior of a random sample from that population. If we want to learn about the behavior of current Etsy users we’re all set—run an experiment with a random sample of users and draw our inferences. But what if we want to know whether a new feature will expand our user base? It’s harder (though not necessarily impossible) to get a random sample of people who aren’t currently using Etsy but who might be interested in some as-yet-undeveloped Etsy feature. Consequently, there’s a real risk of looking under the streetlamp (running experiments on our current users) because the light is better, rather than looking somewhere where we might actually find the keys to a hot new car.

Finally there’s one larger structural problem with the scientific method captured by a famous paraphrase of something Max Planck, founder of quantum theory and the 1918 physics Nobel laureate said: “Science progresses one funeral at a time.” In other words, it can be difficult to let go of hard-won scientific “truths”. After you’ve gone to all the trouble to ask an interesting question, devise a clever hypothesis, and design and run the painstaking experiments that convince everyone that your hypothesis is correct, you don’t really want to hear about how your hypothesis doesn’t fit with some new bit of data and is bound to be replaced by some new theoretical hotness. Or as Arthur C. Clarke put it in his First Law:

When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.

In product development, instead of distinguished but elderly scientists we have experienced product managers. But we have the same risk of holding on too long to hard-won “scientific” truths. If we do everything right, running experiments to test all our hypothesis about how the product works and even making occasional bold leaps into new territory to escape the trap of A/B incrementalism, eventually accumulated “scientific” knowledge may become a liability. And the danger is even worse in product development than it is in “real” science since we’re not dealing with universal and timeless truths. Indeed, each product you launch (to say nothing of what the rest of the world is doing) changes the landscape into which new products will be launched. What you learned yesterday may no longer apply tomorrow, in large part because of what you do today. Properly applied, science can lead us out of such difficulties but it’s worth noting that it was science, by allowing us to know things at all, that can mislead us into thinking we know a bit more than we really do.

So what to do?

Science is powerful and it’d be silly to abandon it entirely just because there are ways it can go wrong. And it’d be equally silly to think that science can somehow replace strong, talented product designers. The goal should be to combine the two.

The strength of the lone genius designer approach (c.f. Apple) is that when you actually have a genius you get a lot of good ideas and can take giant leaps to products that would never have been arrived at by incremental revisions of anything already in the world. Furthermore, with a single designer imposing their vision on a product, it is more likely to have a coherent, holistic design. On the other hand, if you give your lone genius completely free rein then there is no protection from their occasional bad ideas.

The data-driven approach, (c.f. Google), on the other hand, is much stronger on detecting the actual quality of ideas—bad ideas are abandoned no matter who originally thought they were good and the occasional big wins are clearly identified. But, as noted above, it doesn’t have much to say about where those ideas come from. And it can fall into the trap of focusing on things that are easy to measure and test—don’t know what shade of blue to use? A/B test 41 different shades. (An actual A/B test run at Google.)

Ultimately the goal is to make great products. Great ideas from designers are a necessary ingredient. And A/B testing can definitely improve products. But best is to use both: establish a loop between good design ideas leading to good experiments leading to knowledge about the product leading to even better design ideas. And then allow designers the latitude to occasionally try things that can’t yet be justified by science or even things that my go against current “scientific” dogma.

Posted by on June 21, 2012
Category: data, philosophy


I’m leading an office book discussion on Thomas Kuhn’s *The Structure of Scientific Revolutions* in August and am really looking forward to pointing to your blog entry here, to show the relevance to my co-workers!

Didn’t science tell you that black background and white text makes reading this a painful experience for many? At least the mobile version. Trust me. Test it.

Hey Scientist, I’m with you! Though we haven’t done any experiments on this I think they have been done in the past. I believe there may be a new design in the works.

Yours in readability,

I found this passage very odd:

“You can’t easily learn about people who aren’t using your product The basis for statistical inference is the sample. We learn about a larger population by measuring the behavior of a random sample from that population. If we want to learn about the behavior of current Etsy users we’re all set—run an experiment with a random sample of users and draw our inferences. But what if we want to know whether a new feature will expand our user base? It’s harder (though not necessarily impossible) to get a random sample of people who aren’t currently using Etsy but who might be interested in some as-yet-undeveloped Etsy feature.”

Its really not that difficult. This sort of testing is accomplished through focus groups. Its a matter of pulling together focus groups of, for example, people who shop online at Amazon; bringing them to a focus group facility, setting them up with computer terminals, letting them play with Etsy, and then doing the traditional focus group debrief afterwards. Then you’ll have data from their online behavior at the terminal as well the opinions they share afterwards. There are a wide variety of research firms that specialize in this.

Perhaps this reliance on testing among the same closed population group – people who use Etsy at the current moment – is the reason many have noticed a creeping “sameness” to Etsy’s visuals, whether its the front page, the finds, or other pages. If you’re in essence asking the same group over and over again what they like, you’re going to keep getting the same answers. And you’ll never identify what you might do differently to attract people who aren’t already Etsy users, or bring back those who left.

Another thing I found odd about this article is the total absence of any thought to participation by and feedback from Etsy sellers. It is a glaring omission in the development process. Human creativity and a deep understanding of what it means to use the site to sell and buy are indispensable resources, and it appears Etsy does not recognize this.

The A/B testing described here can quantify clicks, sales, bounce rates – but it is insensitive to factors like how easy (or not) it is to list products for sale and other functions sellers regularly perform. A/B testing doesn’t capture the behavior of people like myself – as a seller, I spend a ridiculous amount of time on the Etsy site, but I have largely given up searching for supplies on Etsy because it is a disorganized mess.

The A/B tests are oblivious to the reasons shoppers give up buying on Etsy, and what it would take to bring them back. But a low tech approach – simply asking sellers what they think – would bring you a wealth of information.

We now know that taking the time to literally ask people what they think — for example, when doctors ask patients how they feel and actually listen to the answers, or criminologists ask criminals why they committed a crime and actually listen to the answers — you uncover vital information that the scientific method misses.

Your stated mission is to make Etsy a vibrant, thriving platform for people to make a living making things. You’ll never fulfill Etsy’s potential if you fail to ask questions of the very people you are trying to help.

The difficulty of quantifying the answers should not be a barrier to seeking knowledge.

Nina, indeed. Note that the passage you quote is in a section “The limits of science”. In fact, I was trying to make the same point you make in your comment: A/B testing can’t tell you everything you really need to know to develop successful products, exactly because of the limitations mentioned in that section. However I was writing about how we use A/B testing at Etsy–not about our whole product development process.

Glad to hear I interpreted that passage correctly. My point speaks to your statements “You can’t easily learn about people who aren’t using your product… It’s harder (though not necessarily impossible) to get a random sample of people who aren’t currently using Etsy but who might be interested in some as-yet-undeveloped Etsy feature.”

It is not true that you can’t easily learn about people who aren’t using your product, nor is terribly difficult to put together focus groups that are, if not statistically valid, come close to approximating the composition of a particular sample.

You can’t get this information through the methodology you’ve described here — but you can get it. You can’t quantify the results with the same precision, but the depth of the information you gain is much greater.

It seems that Etsy does not test site functionality among visitors unfamiliar with Etsy. In a word, this is nuts. You are missing a huge piece of the puzzle.

Take the debacle with the Weddings category, when only 2,000 odd listings would load, and at the bottom of the screen the message “there are no more listings to show” appeared. People familiar with Etsy knew there were tens of thousands more listings under Weddings, and clicked around until they could be seen. Someone unfamiliar with the site would have assumed they’d seen it all and moved on. Mistakes like this would not happen if new visitors were part of the test paradigm.

So, the search team is the leader at Etsy of your data driven product development? And you’re pushing all the departments to follow their lead? You are aware, I hope, that the search on Etsy along with the inadequate category browsing is the number one complaint from buyers and sellers here. They were never good to start with and recent changes are causing buyers to give up and leave the site. I know your data doesn’t tell you this, because Admin posts in the forums say that. But you must be measuring the wrong data, or looking at it wrong, because it is becoming unusable for certain niche categories of items to be found or browsed. Users in the forums are screaming for some inclusion into the product design development and the result is that we are sent to pages like this. And this does not alleviate concerns, it does just the opposite.

[…] – check error rates in Production (and decide whether to roll back or […]

[…] order to collect this data from multiple providers, we make heavy use of A/B testing. This approach lets us easily balance the usage of providers A and B one week, and B and C the […]

[…] Building websites with science « Code as Craft – October 11th %(postalicious-tags)( tags: etsy coding blog programming a/b a/b testing testing philosophy database )% […]

Holy smokes this is a fantastic article. I’m glad I went digging through the archives. I really appreciate how the author defines the local maxima issue and its relevance to A/B testing vs. creative jumps. Very useful analogue.