The Surprising Power of Online Experiments Harvard Business Review Pdf

In Brief

The Demand

When building websites and applications, too many companies make decisions—on everything from new product features, to wait and feel, to marketing campaigns—using subjective opinions rather than hard data.

The Solution

Companies should behave online controlled experiments to evaluate their ideas. Potential improvements should be rigorously tested, because large investments can fail to deliver, and some tiny changes can be surprisingly detrimental while others have big payoffs.

Implementation

Leaders should understand how to properly design and execute A/B tests and other controlled experiments, ensure their integrity, interpret their results, and avoid pitfalls.

In 2012 a Microsoft employee working on Bing had an idea nigh changing the way the search engine displayed ad headlines. Developing it wouldn't require much effort—just a few days of an engineer's time—but it was one of hundreds of ideas proposed, and the program managers deemed it a low priority. Then it languished for more than than six months, until an engineer, who saw that the toll of writing the code for it would be small-scale, launched a simple online controlled experiment—an A/B test—to assess its bear on. Inside hours the new headline variation was producing abnormally high revenue, triggering a "too good to be true" alert. Usually, such alerts indicate a bug, merely not in this case. An analysis showed that the modify had increased revenue past an astonishing 12%—which on an almanac basis would come to more than $100 meg in the United States alone—without hurting central user-feel metrics. It was the best revenue-generating idea in Bing'southward history, but until the test its value was underappreciated.

Humbling! This case illustrates how difficult it can be to assess the potential of new ideas. But as important, information technology demonstrates the do good of having a capability for running many tests cheaply and concurrently—something more than businesses are starting to recognize.

Today, Microsoft and several other leading companies—including Amazon, Booking.com, Facebook, and Google—each conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users. Start-ups and companies without digital roots, such as Walmart, Hertz, and Singapore Airlines, also run them regularly, though on a smaller calibration. These organizations take discovered that an "experiment with everything" arroyo has surprisingly large payoffs. It has helped Bing, for instance, identify dozens of revenue-related changes to make each calendar month—improvements that accept collectively increased revenue per search by ten% to 25% each yr. These enhancements, along with hundreds of other changes per calendar month that increase user satisfaction, are the major reason that Bing is profitable and that its share of U.S. searches conducted on personal computers has risen to 23%, upwards from viii% in 2009, the year it was launched.

At a time when the web is vital to most all businesses, rigorous online experiments should be standard operating procedure. If a company develops the software infrastructure and organizational skills to comport them, it will exist able to assess non simply ideas for websites but also potential business models, strategies, products, services, and marketing campaigns—all relatively inexpensively. Controlled experiments tin can transform decision making into a scientific, testify-driven process—rather than an intuitive reaction. Without them, many breakthroughs might never happen, and many bad ideas would be implemented, only to neglect, wasting resources.

Yet we have found that also many organizations, including some major digital enterprises, are haphazard in their experimentation arroyo, don't know how to run rigorous scientific tests, or comport way also few of them.

Together we've spent more than 35 years studying and practicing experiments and advising companies in a broad range of industries virtually them. In these pages we'll share the lessons we've gleaned about how to pattern and execute them, ensure their integrity, interpret their results, and address the challenges they're likely to pose. Though we'll focus on the simplest kind of controlled experiment, the A/B exam, our findings and suggestions use to more-circuitous experimental designs as well.

Appreciate the Value of A/B Tests

In an A/B test the experimenter sets up two experiences: "A," the command, is ordinarily the electric current organization and considered the "champion," and "B," the treatment, is a modification that attempts to improve something—the "challenger." Users are randomly assigned to the experiences, and key metrics are computed and compared. (Univariable A/B/C tests and A/B/C/D tests and multivariable tests, in contrast, assess more than than one handling or modifications of different variables at the same fourth dimension.) Online, the modification could be a new characteristic, a change to the user interface (such equally a new layout), a dorsum-stop alter (such as an improvement to an algorithm that, say, recommends books at Amazon), or a different concern model (such as an offer of free shipping). Whatever aspect of operations companies care most about—be it sales, repeat usage, click-through rates, or time users spend on a site—they tin can use online A/B tests to acquire how to optimize it.

Any visitor that has at to the lowest degree a few thousand daily agile users can conduct these tests. The ability to access large client samples, to automatically collect huge amounts of data nigh user interactions on websites and apps, and to run concurrent experiments gives companies an unprecedented opportunity to evaluate many ideas apace, with great precision, and at a negligible cost per incremental experiment. That allows organizations to iterate rapidly, fail fast, and pivot.

Recognizing these virtues, some leading tech companies have dedicated unabridged groups to building, managing, and improving an experimentation infrastructure that tin can exist employed by many product teams. Such a capability can exist an important competitive reward—provided yous know how to use information technology. Here'due south what managers need to sympathise:

Tiny changes can take a large impact.

People commonly assume that the greater an investment they brand, the larger an impact they'll see. But things rarely work that way online, where success is more than almost getting many small changes correct. Though the business globe glorifies big, disruptive ideas, in reality about progress is achieved by implementing hundreds or thousands of minor improvements.

Putting credit card offers on the shopping cart page boosted profits by millions.

Consider the following example, once again from Microsoft. (While virtually of the examples in this article come from Microsoft, where Ron heads experimentation, they illustrate lessons drawn from many companies.) In 2008 an employee in the United kingdom fabricated a seemingly small proposition: Have a new tab (or a new window in older browsers) automatically open up whenever a user clicks on the Hotmail link on the MSN habitation folio, instead of opening Hotmail in the same tab. A test was run with near 900,000 United kingdom users, and the results were highly encouraging: The appointment of users who opened Hotmail increased by an impressive viii.9%, as measured past the number of clicks they fabricated on the MSN home page. (Most changes to engagement have an issue smaller than one%.) Notwithstanding, the thought was controversial because few sites at the time were opening links in new tabs, so the modify was released only in the UK.

In June 2010 the experiment was replicated with 2.vii million users in the United states, producing similar results, then the modify was rolled out worldwide. And then, to see what consequence the thought might have elsewhere, Microsoft explored the possibility of having people who initiated a search on MSN open up the results in a new tab. In an experiment with more than 12 million users in the United States, clicks per user increased by 5%. Opening links in new tabs is one of the all-time ways to increase user engagement that Microsoft has ever introduced, and all it required was changing a few lines of code. Today many websites, including Facebook.com and Twitter.com, use this technique.

Microsoft's experience is inappreciably unique. Amazon's experiments, for example, revealed that moving credit bill of fare offers from its home page to the shopping cart page additional profits past tens of millions of dollars annually. Clearly, small investments can yield large returns. Large investments, however, may accept niggling or no payoff. Integrating Bing with social media—and then that content from Facebook and Twitter opened on a third pane on the search results folio—cost Microsoft more than $25 million to develop and produced negligible increases in engagement and revenue.

Experiments can guide investment decisions.

Online tests can aid managers figure out how much investment in a potential improvement is optimal. This was a decision Microsoft faced when it was looking at reducing the time it took Bing to brandish search results. Of grade, faster is improve, but could the value of an comeback be quantified? Should there exist three, 10, or perhaps l people working on that performance enhancement? To respond those questions, the company conducted a series of A/B tests in which artificial delays were added to study the effects of infinitesimal differences in loading speed. The information showed that every 100-millisecond deviation in performance had a 0.half-dozen% impact on revenue. With Bing'due south yearly acquirement surpassing $3 billion, a 100-millisecond speedup is worth $18 million in annual incremental acquirement—enough to fund a sizable team.

The test results as well helped Bing make important trade-offs, specifically about features that might improve the relevance of search results but slow the software's response fourth dimension. Bing wanted to avoid a state of affairs in which many small features cumulatively led to a significant degradation in performance. So the release of individual features that slowed the response by more than than a few milliseconds was delayed until the team improved either their performance or the performance of another component.

Build a Big-Scale Capability

More than a century ago, the department store owner John Wanamaker reportedly coined the marketing adage "Half the money I spend on advertising is wasted; the trouble is that I don't know which half." We've found something similar to be truthful of new ideas: The vast majority of them fail in experiments, and even experts oftentimes misjudge which ones will pay off. At Google and Bing, just most x% to 20% of experiments generate positive results. At Microsoft as a whole, 1-third prove effective, ane-third have neutral results, and i-third have negative results. All this goes to prove that companies need to kiss a lot of frogs (that is, perform a massive number of experiments) to find a prince.

Any effigy that looks interesting or different is usually wrong.

It'south key to experiment with everything to make sure that changes neither are degrading nor have unexpected effects. At Bing nearly 80% of proposed changes are first run every bit controlled experiments. (Some low-adventure issues fixes and machine-level changes like operating system upgrades are excluded.)

Scientifically testing nearly every proposed idea requires an infrastructure: instrumentation (to record such things as clicks, mouse hovers, and event times), information pipelines, and information scientists. Several third-party tools and services make information technology like shooting fish in a barrel to try experiments, but if you want to scale things upwards, you must tightly integrate the capability into your processes. That will drive down the cost of each experiment and increase its reliability. On the other paw, a lack of infrastructure will keep the marginal costs of testing high and could make senior managers reluctant to call for more experimentation.

Microsoft provides a good instance of a substantial testing infrastructure—though a smaller enterprise or one whose business is not as dependent on the experimentation could make practise with less, of class. Microsoft's Analysis & Experimentation team consists of more than eighty people who on any given day help run hundreds of online controlled experiments on various products, including Bing, Cortana, Substitution, MSN, Office, Skype, Windows, and Xbox. Each experiment exposes hundreds of thousands—and sometimes even tens of millions—of users to a new feature or modify. The team runs rigorous statistical analyses on all these tests, automatically generating scorecards that check hundreds to thousands of metrics and flag significant changes.

A visitor's experimentation personnel can exist organized in three ways:

Centralized model.

In this approach a team of data scientists serve the entire company. The advantage is that they can focus on long-term projects, such as building better experimentation tools and developing more-advanced statistical algorithms. Ane major drawback is that the business units using the grouping may take different priorities, which could lead to conflicts over the allocation of resources and costs. Some other con is that information scientists may experience like outsiders when dealing with the businesses and thus be less attuned to the units' goals and domain knowledge, which could brand information technology harder for them to connect the dots and share relevant insights. Moreover, the data scientists may lack the clout to persuade senior direction to invest in building the necessary tools or to get corporate and business concern unit of measurement managers to trust the experiments' results.

Decentralized model.

Some other arroyo is distributing data scientists throughout the different business units. The benefit of this model is that the data scientists tin become experts in each business concern domain. The chief disadvantage is the lack of a clear career path for these professionals, who also may not receive peer feedback and mentoring that help them develop. And experiments in individual units may non have the disquisitional mass to justify building the required tools.

Eye-of-excellence model.

A third pick is to have some data scientists in a centralized function and others within the dissimilar business organisation units. (Microsoft uses this approach.) A middle of excellence focuses mostly on the design, execution, and analysis of controlled experiments. Information technology significantly lowers the fourth dimension and resources those tasks require by edifice a companywide experimentation platform and related tools. Information technology can besides spread best testing practices throughout the organization past hosting classes, labs, and conferences. The main disadvantages are a lack of clarity about what the center of excellence owns and what the product teams own, who should pay for hiring more information scientists when diverse units increment their experiments, and who is responsible for investments in alerts and checks that bespeak results aren't trustworthy.

In that location is no correct or wrong model. Small companies typically showtime with the centralized model or use a 3rd-party tool and so, after they've grown, switch to i of the other models. In companies with multiple businesses, managers who consider testing a priority may not desire to await until corporate leaders develop a coordinated organizational approach; in those cases, a decentralized model might brand sense, at least in the beginning. And if online experimentation is a corporate priority, a company may desire to build expertise and develop standards in a central unit earlier rolling them out in the business organisation units.

Address the Definition of Success

Every business group must define a suitable (commonly composite) evaluation metric for experiments that aligns with its strategic goals. That might audio uncomplicated, but determining which short-term metrics are the best predictors of long-term outcomes is difficult. Many companies get it wrong. Getting it correct—coming upward with an overall evaluation criterion (OEC)—takes thoughtful consideration and frequently extensive internal debate. It requires close cooperation betwixt senior executives who understand the strategy and information analysts who sympathise metrics and trade-offs. And it'due south not a onetime exercise: We recommend that the OEC be adjusted annually.

Arriving at an OEC isn't straightforward, as Bing'due south experience shows. Its fundamental long-term goals are increasing its share of search-engine queries and its advertizing revenue. Interestingly, decreasing the relevance of search results will cause users to issue more queries (thus increasing query share) and click more on ads (thus increasing revenue). Obviously, such gains would only be short-lived, because people would eventually switch to other search engines. So which short-term metrics do predict long-term improvements to query share and revenue? In their discussion of the OEC, Bing'due south executives and data analysts decided that they wanted to minimize the number of user queries for each task or session and maximize the number of tasks or sessions that users conducted.

Information technology's besides important to suspension down the components of an OEC and track them, since they typically provide insights into why an thought was successful. For case, if number of clicks is integral to the OEC, it's critical to measure which parts of a page were clicked on. Looking at different metrics is crucial because it helps teams discover whether an experiment has an unanticipated bear upon on some other area. For case, a team making a modify to the related search queries shown (a search on, say, "Harry Potter," volition testify queries well-nigh Harry Potter books, Harry Potter movies, the casts of those movies, and and then on) may not realize that it's altering the distribution of queries (by increasing searches for the related queries), which could affect revenue positively or negatively.

Over time the process of building and adjusting the OEC and understanding causes and furnishings becomes easier. By running experiments, debugging the results (which we volition discuss in a little chip), and interpreting them, companies volition not merely gain valuable feel with what metrics work all-time for sure types of tests but too develop new metrics. Over the years, Bing has created more than than half dozen,000 metrics experimenters can utilize, which are grouped into templates by the surface area the tests involve (web search, image search, video search, changes to ads, and then on).

Beware of Depression-Quality Data

Information technology doesn't matter how good your evaluation criteria are if people don't trust experiments' results. Getting numbers is like shooting fish in a barrel; getting numbers yous can trust is hard! You demand to classify time and resources to validating the experimentation system and setting up automated checks and safeguards. One method is to run rigorous A/A tests—that is, test something against itself to ensure that well-nigh 95% of the time the system correctly identifies no statistically significant divergence. This simple approach has helped Microsoft identify hundreds of invalid experiments and improper applications of formulas (such as using a formula that assumes all measurements are independent when they are non).

We've learned that the best data scientists are skeptics and follow Twyman'south law: Whatsoever figure that looks interesting or different is unremarkably wrong. Surprising results should be replicated—both to make sure they're valid and to quell people'southward doubts. In 2013, for example, Bing ran a set of experiments with the colors of various text that appeared on its search results folio, including titles, links, and captions. Though the color changes were subtle, the results were unexpectedly positive: They showed that users who saw slightly darker dejection and greens in titles and a slightly lighter black in captions were successful in their searches a larger per centum of the time and that those who establish what they wanted did then in significantly less time.

Since the colour differences are barely perceptible, the results were understandably viewed with skepticism by multiple disciplines, including the design experts. (For years, Microsoft, like many other companies, had relied on expert designers—rather than the behavior of actual users—to ascertain corporate style guides and colors.) So the experiment was rerun with a much larger sample of 32 1000000 users, and the results were like. Analysis indicated that when rolled out to all users, the color changes would increment revenue by more than $10 one thousand thousand annually.

If you want results to be trustworthy, you must ensure that high-quality data is used. Outliers may demand to exist excluded, drove errors identified, and and then on. In the online earth this issue is specially of import, for several reasons. Accept internet bots. At Bing more fifty% of requests come up from bots. That information can skew results or add "racket," which makes information technology harder to detect statistical significance. Some other problem is the prevalence of outlier data points. Amazon, for instance, discovered that sure individual users fabricated massive book orders that could skew an unabridged A/B examination; information technology turned out they were library accounts.

Managers should too beware when some segments experience much larger or smaller furnishings than others do (a phenomenon statisticians call "heterogeneous treatment effects"). In certain cases a single skilful or bad segment can skew the average plenty to invalidate the overall results. This happened in one Microsoft experiment in which ane segment, Net Explorer 7 users, couldn't click on the results of Bing searches considering of a JavaScript issues, and the overall results, which were otherwise positive, turned negative. An experimentation platform should find such unusual segments; if it doesn't, experimenters looking at an boilerplate effect may dismiss a skilful idea equally a bad one.

Results may also exist biased if companies reuse command and treatment populations from ane experiment to another. That practice leads to "carryover effects," in which people'south feel in an experiment alters their future behavior. To avoid this phenomenon, companies should "shuffle" users between experiments.

Another common check Microsoft'southward experimentation platform performs is validating that the percentages of users in the command and handling groups in the actual experiment lucifer the experimental design. When these differ, in that location is a "sample ratio mismatch," which oftentimes voids the results. For example, a ratio of l.2/49.8 (821,588 versus 815,482 users) diverges enough from an expected l/fifty ratio that the probability that information technology happened by take a chance is less than i in 500,000. Such mismatches occur regularly (usually weekly), and teams need to be diligent in understanding why and resolving them.

Avoid Assumptions About Causality

Because of the hype over big data, some executives mistakenly believe that causality isn't important. In their minds all they need to do is institute correlation, and causality can be inferred. Wrong!

The following two examples illustrate why—and also highlight the shortcomings of experiments that lack control groups. The first concerns two teams that conducted separate observational studies of ii advanced features for Microsoft Office. Each concluded that the new characteristic it was assessing reduced attrition. In fact, nearly any advanced characteristic will prove such a correlation, because people who will effort an advanced feature tend to be heavy users, and heavy users tend to have lower attrition. And so while a new advanced feature might be correlated with lower compunction, it doesn't necessarily cause it. Office users who get error messages besides take lower attrition, because they too tend to be heavy users. Just does that hateful that showing users more fault messages will reduce compunction? Hardly.

The second case concerns a report Yahoo did to assess whether display ads for a brand, shown on Yahoo sites, could increase searches for the brand name or related keywords. The observational part of the study estimated that the ads increased the number of searches past 871% to 1,198%. Merely when Yahoo ran a controlled experiment, the increase was simply five.4%. If not for the control, the company might accept concluded that the ads had a huge bear on and wouldn't take realized that the increment in searches was due to other variables that changed during the ascertainment menstruation.

Some executives believe that all they need to practise is establish correlation. Wrong!

Conspicuously, observational studies cannot institute causality. This is well known in medicine, which is why the U.Due south. Food and Drug Administration mandates that companies conduct randomized clinical trials to prove that their drugs are safe and effective.

Including too many variables in tests also makes it hard to learn about causality. With such tests information technology'due south hard to disentangle results and interpret them. Ideally, an experiment should be simple plenty that crusade-and-consequence relationships can exist easily understood. Another downside of complex designs is that they make experiments much more vulnerable to bugs. If a new characteristic has a 10% run a risk of triggering an egregious problem that requires aborting its examination, and so the probability that a change that involves seven new features volition have a fatal issues is more than l%.

What if y'all can determine that ane matter causes some other, simply you don't know why? Should you endeavour to understand the causal mechanism? The curt answer is yep.

Between 1500 and 1800, virtually 2 meg sailors died of scurvy. Today we know that scurvy is acquired past a lack of vitamin C in the nutrition, which sailors experienced because they didn't have acceptable supplies of fruit on long voyages. In 1747, Dr. James Lind, a surgeon in the Imperial Navy, decided to do an experiment to test 6 possible cures. On 1 voyage he gave some sailors oranges and lemons, and others culling remedies like vinegar. The experiment showed that citrus fruits could forbid scurvy, though no ane knew why. Lind mistakenly believed that the acerbity of the fruit was the cure and tried to create a less-perishable remedy by heating the citrus juice into a concentrate, which destroyed the vitamin C. Information technology wasn't until 50 years later, when unheated lemon juice was added to sailors' daily rations, that the Royal Navy finally eliminated scurvy among its crews. Presumably, the cure could accept come much earlier and saved many lives if Lind had run a controlled experiment with heated and unheated lemon juice.

That said, we should point out that you lot don't always have to know the "why" or the "how" to do good from cognition of the "what." This is specially true when information technology comes to the behavior of users, whose motivations can be difficult to determine. At Bing some of the biggest breakthroughs were made without an underlying theory. For example, even though Bing was able to amend the user feel with those subtle changes in the colors of the blazon, there are no well-established theories about color that could help information technology understand why. Hither the prove took the place of theory.

Conclusion

The online earth is oftentimes viewed as turbulent and total of peril, only controlled experiments can assist us navigate it. They can bespeak us in the correct direction when answers aren't obvious or people have conflicting opinions or are uncertain well-nigh the value of an idea.

Several years ago, Bing was debating whether to make ads larger so that advertisers could include links to specific landing pages in them. (For example, a loan visitor might provide links like "compare rates" and "about the company" instead of only 1 to a domicile folio.) A downside was that larger ads obviously would have upward more screen real estate, which is known to increase user dissatisfaction and churn. The people because the thought were split. And then the Bing team experimented with increasing the ads' size while keeping the overall screen space allotted for ads constant, which meant showing fewer of them. The upshot was that showing fewer merely larger ads led to a big improvement: Revenue increased by more than $fifty million annually without hurting the key aspects of the user feel.

If you really desire to understand the value of an experiment, look at the difference between its expected outcome and its actual result. If y'all idea something was going to happen and it happened, then you haven't learned much. If y'all thought something was going to happen and it didn't, so yous've learned something important. And if you thought something minor was going to happen, and the results are a major surprise and lead to a breakthrough, you've learned something highly valuable.

Past combining the ability of software with the scientific rigor of controlled experiments, your visitor can create a learning lab. The returns you reap—in cost savings, new revenue, and improved user experience—tin can be huge. If you lot want to gain a competitive reward, your firm should build an experimentation capability and master the scientific discipline of conducting online tests.

A version of this article appeared in the September–Oct 2017 issue (pp.74–82) of Harvard Business Review.

evansmookedis.blogspot.com

Source: https://hbr.org/2017/09/the-surprising-power-of-online-experiments

0 Response to "The Surprising Power of Online Experiments Harvard Business Review Pdf"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel