Experiments at Airbnb - Airbnb Engineering

29/5/2014 Experiments at Airbnb - Airbnb Engineering
http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer 1/15
Experiments at Airbnb
Airbnb
[1]
is an online two-sided marketplace that matches people who rent out their
homes (hosts) with people who are looking for a place to stay (guests). We use
controlled experiments to learn and make decisions at every step of product
development, from design to algorithms. They are equally important in shaping the
user experience.
While the basic principles behind controlled experiments are relatively straightforward,
using experiments in a complex online ecosystem like Airbnb during fast-paced product
development can lead to a number of common pitfalls. Some, like stopping an
experiment too soon, are relevant to most experiments. Others, like the issue of
introducing bias on a marketplace level, start becoming relevant for a more specialized
application like Airbnb. We hope that by sharing the pitfalls weve experienced and
learned to avoid, we can help you to design and conduct better, more reliable
experiments for your own application.
Why experiments?
Experiments provide a clean and simple way to make causal inference. Its often
surprisingly hard to tell the impact of something you do by simply doing it and seeing
what happens, as illustrated in Figure 1.
[2]
[3]
Figure 1 Its hard to tell the effect of this product launch.
The outside world often has a much larger effect on metrics than product changes do.
Users can behave very differently depending on the day of week, the time of year, the
weather (especially in the case of a travel company like Airbnb), or whether they
learned about the website through an online ad or found the site organically.
Controlled experiments isolate the impact of the product change while controlling for
the aforementioned external factors. In Figure 2, you can see an example of a new
feature that we tested and rejected this way. We thought of a new way to select what
prices you want to see on the search page, but users ended up engaging less with it
than the old filter, so we did not launch it.
[4]
Figure 2 Example of a new feature that we tested and rejected.
When you test a single change like this, the methodology is often called A/B testing or
split testing. This post will not go into the basics of how to run a basic A/B test. There
are a number of companies that provide out of the box solutions to run basic A/B tests
and a couple of bigger tech companies have open sourced their internal systems for
others to use. See Clouderas Gertrude
[5]
, Etsys Feature
[6]
, and Facebooks PlanOut
[7]
,
for example.
The case of Airbnb
At Airbnb we have built our own A/B testing framework to run experiments which you
will be able to read more about in our upcoming blog post on the details of its
implementation. There are a couple of features of our business that make
experimentation more involved than a regular change of a button color, and thats why
we decided to create our own testing framework.
First, users can browse when not logged in or signed up, making it more difficult to tie
a user to actions. People often switch devices (between web and mobile) in the midst of
booking. Also given that bookings can take a few days to confirm, we need to wait for
those results. Finally, successful bookings are often dependent on available inventory
and responsiveness of hosts factors out of our control.
Our booking flow is also complex. First, a visitor has to make a search. The next step is
for a searcher to actually contact a host about a listing. Then, the host has to accept an
inquiry and then the guest has to actually book the place.. In addition we have multiple
flows that can lead to a booking a guest can instantly book some listings without a
contact, and can also make a booking request that goes straight to booking. This four
step flow is visualized in Figure 3. We look at the process of going through these four
stages, but the overall conversion rate between searching and booking is our main
metric.
[8]
[9]
Figure 3 Example of an experiment result broken down by booking flow steps.
How long do you need to run an experiment?
A very common source of confusion in online controlled experiments is how much time
you need to make a conclusion about the results of an experiment. The problem with
the naive method of using of the p-value as a stopping criterion is that the statistical
test that gives you a p-value assumes that you designed the experiment with a sample
and effect size in mind. If you continuously monitor the development of a test and the
resulting p-value, you are very likely to see an effect, even if there is none. Another
common error is to stop an experiment too early, before an effect becomes visible.
Here is an example of an actual experiment we ran. We tested changing the maximum
value of the price filter on the search page from $300 to $1000 as displayed below.
[10]
[11]
Figure 4 Example experiment testing the value of the price filter
In Figure 5 we show the development of the experiment over time. The top graph shows
the treatment effect (Treatment / Control 1) and the bottom graph shows the p-value
over time. As you can see, the p-value curve hits the commonly used significant value
of 0.05 after 7 days, at which point the effect size is 4%. If we had stopped there, we
would have concluded that the treatment had a strong and significant effect on the
likelihood of booking. But we kept the experiment running and we found that actually,
the experiment ended up neutral. The final effect size was practically null, with the p-
value indicating that whatever the remaining effect size was, it should be regarded as
noise.
[12]
[13]
Figure 5 Result of the price filter experiment over time.
Why did we know to not stop when the p-value hit 0.05? It turns out that this pattern
of hitting significance early and then converging back to a neutral result is actually
quite common in our system. There are various reasons for this. Users often take a long
time to book, so the early converters have a disproportionately large influence in the
beginning of the experiment. Also, even small sample sizes in online experiments are
massive in the scale of classical statistics in which these methods were developed.
Since the statistical test is a function of the sample- and effect sizes, if an early effect
size is large through natural variation it is likely for the p-value to be below 0.05 early.
But the most important reason is that you are performing a statistical test every time
you compute a p-value and the more you do it, the more likely you are to find an effect.
As a side note, people familiar with our website might notice that, at time of writing,
we did in fact launch the increased max price filter, even though the result was neutral.
We found that certain users like the ability to search for high-end places and decided to
accommodate them, given there was no dip in the metrics.
How long should experiments run for then? To prevent a false negative (a Type II error),
the best practice is to determine the minimum effect size that you care about and
compute, based on the sample size (the amount of new samples that come every day)
and the certainty you want, how long to run the experiment for, before you start the
experiment. Here
[14]
is a resource that helps with that computation. Setting the time in
advance also minimizes the likelihood of finding a result where there is none.
One problem, though, is that we often dont have a good idea of the size, or even the
direction, of the treatment effect. It could be that a change is actually hugely successful
and major profits are being lost by not launching the successful variant sooner. Or, on
the other side, sometimes an experiment introduces a bug, which makes it much better
to stop the experiment early before more users are alienated.
The moment when an experiment dabbles in the otherwise significant region could
be an interesting one, even when the pre-allotted time has not passed yet. In the case
of the price filter experiment example, you can see that when significance was first
reached, the graph clearly did not look like it had converged yet. We have found this
heuristic to be very helpful in judging whether or not a result looks stable. It is
important to inspect the development of the relevant metrics over time, rather than to
consider the single result of an effect with a p-value.
We can use this insight to be a bit more formal about when to stop an experiment, if
its before the allotted time. This can be useful if you do want to make an automated
judgment call on whether or not the change that youre testing is performing
particularly well or not, which is helpful when youre running many experiments at the
same time and cannot manually inspect them all systematically. The intuition behind
it is that you should be more skeptical of early results. Therefore the threshold under
which to call a result is very low at the beginning. As more data comes in, you can
increase the threshold as the likelihood of finding a false positive is much lower later in
the game.
We solved the problem of how to figure out the p-value threshold at which to stop an
experiment by running simulations and deriving a curve that gives us a dynamic (in
time) p-value threshold to determine whether or not an early result is worth
investigating. We wrote code to simulate our ecosystem with various parameters and
used this to run many simulations with varying values for parameters like the real effect
size, variance and different levels of certainty. This gives us an indication of how likely
it is to see false positives or false negatives, and also how far off the estimated effect
size is in case of a true positive. In Figure 6 we show an example decision boundary.
[15]
[16]
Figure 6 An example of a dynamic p-value curve.
It should be noted that this curve is very particular to our system and the parameters
that we used for this experiment. We share the graph as an example for you to use for
your own analysis.
Understanding results in context
A second pitfall is failing to understand results in their full context. In general, it is
good practice to evaluate the success of an experiment based on a single metric of
interest. This is to prevent cherry-picking of significant results in the midst of a sea of
neutral ones. However, by just looking at a single metric you lose a lot of context that
could inform your understanding of the effects of an experiment.
Lets go through an example. Last year we embarked on a journey to redesign our
search page. Search is a fundamental component of the Airbnb ecosystem. It is the
main interface to our inventory and the most common way for users to engage with our
website. So, it was important for us to get it right. In Figure 7 you can see the before and
after stages of the project. The new design puts more emphasis on pictures of the
listings (one of our assets since we offer professional photography to our hosts) and the
map that displays where listings are located. You can read about the design and
implementation process in another blog post here
[17]
.
[18]
[19]
Figure 7 Before and after a full redesign of the search page.
A lot of work went into the project, and we all thought it was clearly better; our users
agreed in qualitative user studies. Despite this, we wanted to evaluate the new design
quantitatively with an experiment. This can be hard to argue for, especially when
testing a big new product like this. It can feel like a missed marketing opportunity if we
dont launch to everyone at the same time. However, to keep in the spirit of our testing
culture, we did test the new design to measure the actual impact and, more
importantly, gather knowledge about which aspects did and didnt work.
After waiting for enough time to pass, as calculated with the methodology described in
the previous section, we ended up with a neutral result. The change in the global
metric was tiny and the p-value indicated that it was basically a null effect. However,
we decided to look into the context and to break down the result to try to see if we
could figure out why this was the case. Because we did this, we found that the new
design was actually performing fine in most cases, except for Internet Explorer. We then
realized that the new design broke an important click-through action for certain older
versions of IE, which obviously had a big negative impact on the overall results. When
we fixed this, IE displayed similar results to the other browsers, a boost of more than
2%.
[20]
[21]
Figure 8 Results of the new search design.
Apart from teaching us to pay more attention to QA for IE, this was a good example of
what lessons you can learn about the impact of your change in different contexts. You
can break results down by many factors like browser, country and user type. It should be
noted that doing this in the classic A/B testing framework requires some care. If you
test breakdowns individually as if they were independent, you run a big risk of finding
effects where there arent, just like in the example of continuously monitoring the
effect of the previous section. Its very common to be looking at a neutral experiment,
break it down many ways and to find a single significant effect. Declaring victory for
that particular group is likely to be incorrect. The reason for this is that you are
performing multiple tests with the assumption that they are all independent, which
they are not. One way of dealing with this problem is to decrease the p-value by which
you decide the effect is real. Read more about this approach here
[22]
. Another way is to
model the effects on all breakdowns directly with a more advanced method like logistic
regression.
Assuming the system works
The third and final pitfall is assuming that the system works the way you think or hope
it does. This should be a concern if you build your own system to evaluate experiments
as well as if you use a third party tool. In either case, its possible that what the system
tells you does not reflect reality. This can happen either because its faulty or because
youre not using it correctly. One way to evaluate the system and your interpretation of
it is by formulating hypotheses and then verifying them.
[23]
[24]
Figure 9 Results of an example dummy experiment.
Another way of looking at this is the observation that results too good to be true have a
higher likelihood of being false. When you encounter results like this, it is good
practice to be skeptical of them and scrutinize them in whatever way you can think of,
before you consider them to be accurate.
A simple example of this process is to run an experiment where the treatment is equal
to the control. These are called A/A or dummy experiments. In a perfect world the
system would return a neutral result (most of the time). What does your system return?
We ran many experiments like this (see an example run in Figure 9) and identified a
number of issues within our own system as a result. In one case, we ran a number of
dummy experiments with varying sizes of control and treatment groups. A number of
them were evenly split, for example with a 50% control and a 50% treatment group
(where everybody saw exactly the same website). We also added cases like a 75% control
and a 25% treatment group. The results that we saw for these dummy experiments are
displayed in Figure 10.
[25]
[26]
Figure 10 Results of a number of dummy experiments.
You can see that in the experiments where the control and treatment groups are the
same size, the results look neutral as expected (its a dummy experiment so the
treatment is actually the same as the control). But, for the case where the group sizes
are different, there is a massive bias against the treatment group.
We investigated why this was the case, and uncovered a serious issue with the way we
assigned visitors that are not logged into treatment groups. The issue is particular to
our system, but the general point is that verifying that the system works the way you
think it does is worthwhile and will probably lead to useful insights.
One thing to keep in mind when you run dummy experiments is that you should expect
some results to come out as non-neutral. This is because of the way the p-value works.
For example, if you run a dummy experiment and look at its performance broken down
by 100 different countries, you should expect, on average, 5 of them to give you a non-
neutral result. Keep this in mind when youre scrutinizing a 3rd party tool!
Conclusion
Controlled experiments are a great way to inform decisions around product
development. Hopefully, the lessons in this post will help prevent some common A/B
testing errors.
First, the best way to determine how long you should run an experiment is to compute
the sample size you need to make an inference in advance. If the system gives you an
early result, you can try to make a heuristic judgment on whether or not the trends
have converged. Its generally good to be conservative in this scenario. Finally, if you do
need to make procedural launch and stopping decisions, its good to be extra careful by
employing a dynamic p-value threshold to determine how certain you can be about a
result. The system we use at Airbnb to evaluate experiments employs all three ideas to
help us with our decision-making around product changes.
It is important to consider results in context. Break them down into meaningful cohorts
and try to deeply understand the impact of the change you made. In general,
experiments should be run to make good decisions about how to improve the product,
rather than to aggressively optimize for a metric. Optimizing is not impossible, but
often leads to opportunistic decisions for short-term gains. By focusing on learning
about the product you set yourself up for better future decisions and more effective
tests.
Finally, it is good to be scientific about your relationship with the reporting system. If
something doesnt seem right or if it seems too good to be true, investigate it. A simple
way of doing this is to run dummy experiments, but any knowledge about how the
1. http://www.airbnb.com/
2. http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png
3. http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png
4. http://nerds.airbnb.com/wp-content/uploads/2014/05/img2_price.png
5. https://github.com/cloudera/gertrude
6. https://github.com/etsy/feature
7. http://facebook.github.io/planout/
8. http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png
9. http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png
10. http://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.png
11. http://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.png
12. http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png
13. http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png
14. http://www.evanmiller.org/ab-testing/sample-size.html
15. http://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.png
16. http://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.png
17. http://nerds.airbnb.com/redesigning-search/
18. http://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.png
19. http://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.png
20. http://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png
21. http://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png
system behaves is useful for interpreting results. At Airbnb we have found a number of
bugs and counter-intuitive behaviors in our system by doing this.
Together with Will Moss, I gave a public talk on this topic in April 2014. You can watch
a video recording of it here
[27]
. We hope this post was insightful for those who want to
improve their own experimentation.
Want to work at Airbnb? We're hiring!
Browse openings
22. http://www.evanmiller.org/how-not-to-run-an-ab-test.html
23. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.png
24. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.png
25. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.png
26. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.png
27. https://www.youtube.com/watch?v=lVTIcf6IhY4

Experiments at Airbnb - Airbnb Engineering

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Experiments at Airbnb - Airbnb Engineering

Încărcat de

Drepturi de autor:

Formate disponibile

29/5/2014 Experiments at Airbnb - Airbnb Engineering

S-ar putea să vă placă și