Randomized experiments have taken the fields of policy evaluation and empirical economics by storm. Hailed by their adherents as bringing about a new era of “evidence-based policymaking,” and a “credibility revolution” in economics and related fields, they have changed the world of policy research for good.
The standard way of doing experiments
In their simplest and most common form, randomized experiments work as follows. Suppose you are interested in a policy that is administered on an individual level, such as some training program, cash transfer, or health intervention. Your randomized experiment to test this policy would run like this: You first find a sample of individuals, where the sample is hopefully representative of the population of interest. Then you randomly select half of the individuals in this sample to be in the treatment group, and the other half to be in the control group. The policy is administered, and after the experiment you collect data on outcomes of interest—employment, for example. The policy is then declared a success if (1) the average employment in the treated group is higher than the average employment in the control group, and (2) the difference is “statistically significant.”
But is this the best way to go about things? Can we maybe make better policy choices, with smaller experimental budgets, by doing things a little differently? This is the question that Anja Sautmann and I address in our new paper on “Adaptive experiments for policy choice.” A preliminary draft of this work is available here.
The goal of many experiments is policy choice
The ultimate goal of many experiments is to gather information to inform a policy choice. For example, we are currently evaluating a variety of interventions—providing information sessions, psychological counseling, or financial support—to help Syrian refugees in Jordan find jobs in the formal labor market.1 The goal of this project—like the goal of many experiments in policy, medicine, or business—is to understand which of these interventions is the most helpful.
Trying to identify the best policy is different from estimating the precise impact of every individual policy: as long as we can identify the best policy, we do not care about the precise impacts of inferior policies. Yet, despite this, most experiments follow protocols that are designed to figure out the impact of every policy, even the obviously inferior ones.
If we wanted to understand the impact of all potential treatments, the best thing to do would be to split participants equally across these groups. This is what traditional experimental design recommends, because it gives each treatment enough statistical power to identify a precise effect. However, suppose, in the context of our experiment on finding jobs for refugees, that providing information is an ineffective policy. In this case, splitting the sample at the outset will lead us to “waste” sample size learning about the precise impacts of a treatment that is clearly suboptimal: we would have preferred to put more effort into running a horse race between the other policies, providing counseling vs. providing financial resources.
A better design for policy choice
In our paper, we propose a framework that does a better job of identifying the best policy. The key to our proposal is staging: rather than running the experiment all at once, we propose that researchers start by running a first round of the experiment with a smaller number of participants. Based on this first round, you will be able to identify which treatments are clearly not likely to be the best. You can then go on to run another round of the experiment where you focus attention on those treatments that performed well in the first round. This way you will end up with a lot more observations to distinguish between the best performing treatments. And if you can run the experiment in several rounds, you can do even better.
We show (both theoretically and in applications) that such a procedure, where you shift towards better treatments with the right speed, gives you a much better chance of picking the best policy after the experiment – which is what your goal was, after all. In contrast to the traditional method, ours allows you to achieve better outcomes with smaller sample sizes.
Bandits—related, but different.
Some readers might be reminded of so-called bandit problems and corresponding algorithms when reading this. Bandit algorithms are related, but different from what we are aiming for here. Bandit algorithms are often used for targeted advertising by the big internet corporations. They are based on continuous experimentation, where the goal is to maximize outcomes during the experiment itself. This is different from our goal, which is to pick the best policy after the experiment. (See this app, which allows you to calculate treatment assignment probabilities in this way.)
The difference matters. Consider again the approach outlined above, where you run the experiment in two rounds. If you have a bandit objective, you will assign all participants in the second round to the one treatment that performed best in the first round. This maximizes your overall outcomes during the experiment, in expectation. But it does not maximize your chance of picking the best policy after the experiment. It does not allow you to continue learning about the runner-ups for the best policy. It would for instance be strictly better if you split the second round of the experiment equally between the two best performing treatments from the first round.
To summarize, when we set out running experiments (as more generally in life), it is a good idea to take a moment and think about what our ultimate goal is. Since the goal of many experimenters is—at least implicitly—to aid policy choice, we should design experiments accordingly. The procedure that Anja Sautmann and I have proposed helps us do that, by adapting over time and focusing attention on the best performing alternatives.