Generalizing from Lab Experiments

Introduction

This term paper was written as part of the “Social Preference” course at Humboldt Universität zu Berlin. It addresses a relatively recent controversy in empirical economics with wide-ranging consequences for the way future research will be conducted. The debate centers around publications by Steven D. Levitt and John A. List which have been perceived as critical of laboratory setups for economic experiments. Colin Camerer answered Levitt and List, laying out various arguments favoring lab experiments. Camerer’s publication will be the focus of this analysis. The first part will summarize the positions of Levitt and List. Then, the second part will summarize Camerer’s critique of their position before critiquing Camerer’s critique and connecting the paper to our Social Preference course in the third and fourth parts.

Summary of the Levitt & List position

Camerer seeks to address the claims made by Steven D. Levitt and John A. List in three of their publications [@Levitt.2007; @Levitt.2007b; @Levitt.2008] (LL). He primarily focuses on one of these papers titled: "Viewpoint: On the Generalizability of Lab Behavior to the Field” [@Levitt.2007]. Hence, I, too, will focus on the arguments brought forward in this publication. The authors describe their goal with this work to: “summarize, in a provocative manner, some of the important factors at work when extrapolating results from laboratory experiments to the field.” In particular, they focus on four aspects in which the environment in the lab typically differs from the real world and how these differences might influence the generalizability of lab results. To think about these four differences in a structured way, they propose a model of decision-making. Agents are utility maximizing with a utility function:

\[U_i(a, v, n, s) = M_i(a, v, n, s) + W_i(a, v) - c\]

Their utility depends on their actions (a) via two channels. One is the wealth effect ($W_i$), which depends on the action and is an increasing function of the stakes of the decision (v). The second effect is the non-monetary moral cost of the action ($M_i$). This effect is a function of the selected action as well as the magnitude of the negative impact the decision has on others (v), the set of social norms (n), and scrutiny (s). Furthermore, in the model, cognitive costs (c) are assumed to exist and to increase with the difficulty of decision-making.

LL name four key differences between lab and reality, which are summarized below:

Stakes

Levitt and Lists note that when the stakes of real-world situations cannot be replicated in the lab, one can not necessarily assume lab results to generalize to non-experimental situations. In contrast to other disciplines, it is common for economics experiments to have some monetary pay-out for participants depending on their choices. The model assumes that people respond to (monetary) incentives. In the utility equation (1), we see that the situation’s stakes (v) are a factor in utility optimization.

Pleasing the experimenter

Unlike in the real world, in lab experiments, participants know that an experimenter monitors their actions. This alters the scrutiny felt in the situation, hence changing the utility function. This systematically different level of scrutiny would result in systematically different actions by the participants as opposed to agents in real-world situations.

Learning effects

There is a practical limitation for the duration of lab experiments. As the utility function (1) states, decision-making is associated with cognitive costs. For one-time decisions this cognitive cost might be too high to justify searching for the theoretically optimal action. Still, in real-world situations where agents are often repeatedly confronted with similar situations, marginal cognitive costs decrease, which leads to more optimal decision-making. Levitt and List argue that when lab experiments cannot replicate the possibility of accumulating learning effects, we should not expect lab results to be necessarily equivalent to the behaviour observed in the real world.

Selection effects

In economics, lab experiments are typically performed in a university setting with students as participants. The specific forms and weights of utility functions differ from person to person. This does not inhibit the generalizability of results as long as the utility functions of the people tested do not differ systematically from the group one might want to generalize to. LL state that for most questions, this assumption will not hold. The subset of students participating in economics experiments is significantly different from the whole student population, let alone the population at large.

Conclusion

List and Levitt advise caution when generalizing lab experiments for these four main reasons. They predict that "behaviour will converge across situations as the economically and psychologically relevant factors converge” while warning that “relevant factors will rarely converge across the lab and many field settings." They conclude that”at a minimum, lab experiments can provide a crucial first understanding of qualitative effects, suggest underlying mechanisms that might be at work when certain data patterns are observed, provide insights into what can happen, and evoke empirical puzzles.” [@Levitt.2007]

Camerer’s Critique

Camerer’s critique [-@Camerer.2011] features three main arguments. (1) Generalizability is not a main goal for lab experiments. (2) Most features that might compromise the generalizability of lab findings, according to Levitt & List, are not unique to lab experiments, and (3) literature shows that lab-field generalizability is often quite good. In the following, I will examine the merit of each of those claims and show the connections to the claims made by Levitt and List.

1. Generalizability is not a primary concern for lab experiments

Camerer proposes two viewpoints on experimental economics. The scientific view is that “all empirical studies contribute evidence about the general way in which [economic factors] […] influence economic behaviour." The policy view stresses generalizability as it aims to use the knowledge for policy actions. Camerer asserts that Levitt and List subscribe to the policy view, while most experimentalists hold the scientific view.

2. Field experiments suffer from the same flaws in generalizability

YCamerer’s second main argument asserts that factors that might limit the generalizability of lab experiments to the field also create problems for generalizing from field results to other field applications (2.1). He further states that all factors (except for obtrusive observation) are not necessarily part of lab experiments and do not necessarily impact generalizability to the field (2.2).

3. The empirical evidence for differences in lab and field is weak

This argument has three parts that all engage with the current literature on generalizability. First, the initial study that sought to create similar setups for experiments both in the field and the lab to compare results [@List.2006] observed significant differences in lab and field behaviour. Camerer claims this finding to not be statistically reliable based on new, previously unreported analysis. (3.1) Second, other experiments that try to create similar situations to compare behaviour in the lab with behaviour in the field include just one study that gives conclusive evidence in favor of differences in behaviour. (3.2) Third, for papers that compare lab and field results without closely matching the circumstances, more than 20 studies find good comparability, while only 2 find very different results in lab and field. (3.3)

Validity of Critiques

(1) Arguing LL do not adhere to the scientific view described by Camerer misses the core of their argument. In their writings, Levitt and List do not entertain the thought that external validity should be a prerequisite for lab experiments or that experiments aiming to find general principles not applicable in natural environments should not exist. They provide a model to consider which factors in the experimental setup might promote or inhibit generalizability to the field. Their writings do not imply that every experiment has to have perfect external validity. It is unclear if Camerer thinks contemplating external validity is really "distracting" and should be avoided, as it is a net negative for scientific progress. We may assume this is not the case, as a quick search of his publications shows him frequently contemplating external validity of his experiments.1
(2.1) When arguing that field-field generalizability is similarly problematic to lab-field generalizability, [@Camerer.2011] presents an example regarding dictator games and charitable giving. Dictator games are a widely used setup in lab experiments, the results of which show substantial selfless giving. These results have been criticized as they seem to be at odds with the much lower levels of charitable giving observed in the real world. He rejects that critique by arguing that even though people might have interpreted the results of dictator game lab experiments as altruism, comparing these results with charitable giving of earned income in the real world was never reasonable. He then argues in favor of lab experiments: "The nature of entitlements, deservingness, stakes, and obtrusiveness […] can all be controlled much more carefully than in most field settings”. This is true and supports the point LL make. Factors like stakes (v) and scrutiny (s) should be actively considered when setting up lab experiments and generalizing them from the lab to the field. Suppose one is trying to observe social preferences for charitable giving in the real world (as Camerer assumes in this example). In that case, a lab experiment is unlikely to give a useful result as factors like the level of scrutiny might be varied in the lab. However, this does not provide any information about the quantitative results one can expect in the real world as the real-world level of scrutiny is unknown. Hence, only field experiments could generate results that predict further field behaviour, even if the generalizability is quite narrow.
(2.1.1) On scrutiny specifically, Camerer make some arguments, which serve as useful example of problems in other parts of the paper. For the factor of scrutiny to be impactful, Camerer argues subjects would have to “(a) have a view of what hypothesis the experimenter favors (or”demands"); and (b) be willing to sacrifice money to help prove the experimenter’s hypothesis.” He goes on to argue that “condition (a) is just is [sic] not likely to hold because subjects have no consistent view about what the experimenter expects." If we accept the proposition that subjects’ views are inconsistent for the sake of argument, they could still impact the results. Going back to the example of dictator games, imagine no subject has any intrinsic desire to give in a dictator game; 40% of participants think the experimenter expects them to give half their endowment, while 60% believe the experimenter expects them to give nothing. In this case, the participants have inconsistent views of what the experimenter might expect, and still, a minority of participants would skew the results dramatically. This setup would explain the observed giving in dictator games without the need for consistent views of the experimenter’s expectation. Hence, condition (a) does not have to hold for scrutiny to affect the results of an experiment. The argumentation is reminiscent of the popular debate tactic of "false premise setting." The argument focuses on whether expectations of experimenter demand are consistent across participants while pretending that consistency is required to change the results, which it is not. This specific form of straw-manning is emblematic of Camerer’s persistent argument against the worthlessness of lab experiments, while this is a much more extreme and less nuanced claim than any argument written by LL.

(2.1.2) Regarding condition (b), Camerer argues that if subjects prefer to fulfill experimenters’ expectations, the effect will shrink with increasing stakes. He cites @Camerer.1999c, arguing that raising stakes has little effect. After extensively studying the paper, I found that most of the analyzed papers study the effect of increased financial incentives on the performance of cognitive or physical tasks. The difference from the example of the dictator game is that in these cases, the motivation to do well to please the experimenter and the motivation to do well to earn more money are in line with each other, as opposed to the dictator game where pleasing the experimenter might come at the cost of personal financial gain. These experiments do not indicate if the increase in financial rewards for good performance reduces the impact of the preference to please the experimenter. In fact, among the 74 papers considered by Camerer and Hogarth, two observe the change in behaviour through increased financial stakes in a dictator game. Both find significantly less social giving with increased stakes [@Forsythe.1994; @Sefton.1992]. These papers show that in situations where performance and experimenter expectations are aligned increasing financial rewards often reduces performance. This result is explained by the financial rewards reframing the situation and crowing out the (stronger) intrinsic motivation but could also be explained by the financial reward reframing the situation and crowding out the preference to please the experimenter. With increasing rewards, the preference to please the experimenter might decline, making these results artefacts of laboratory conditions. To summarize, the cited meta-analysis is not only inapplicable in large parts, but the small subset of papers analysed that speak to Camerer’s argument explicitly contradict his thesis, showing the exact opposite of what his thesis would predict. Beyond that, taking experimenter demand effects seriously calls into question the interpretation of the whole meta-analysis, as they provide a competing explanation for intrinsic motivation. This example of a quite selective reading of the literature is especially egregious, considering he is the lead author of the paper cited, and thus deserves to be addressed in more detail.

(2.2) It might technically be true that the aspects of lab design LL criticize are not necessary components of such, but even so, it is hardly a critique of LL’s argument. Camerer himself describes what he calls the”common design” of lab experiments as follows: “Typically behavior is observed obtrusively, decisions are described abstractly, subjects are self-selected volunteers from convenience samples (e.g., college students), and per-hour financial incentives are modest.” Even if those characteristics are not by definition linked to lab experiments, they are the current standard and part of the vast majority of experiments, making Camerer’s thesis theoretically valid but pragmatically ineffectual. LL argues that those characteristics create problems in generalizability and explicitly promote the creation of lab experiments whose characteristics fit closer to the real world. They do not argue that lab experiments are inherently bad but raise awareness for specific factors in lab design that might practically impact generalizability.
(3.1) We established that thinking about external validity can be worthwhile and that factors like the existence of an experimenter, the added level of scrutiny, and the atypical demography of participants, as well as the generally low stakes, are factors that might reasonably be considered when generalizing from the lab to the field. The literature indicating how large the differences between lab and field results might be. Camerer focuses on a single paper which tried to create analogous experiments in a lab and field setting to observe differences in behaviour [@List.2006]. After requesting a re-examining of List’s data, Camerer claims two new findings. The experiment observes the interaction of buyers and sellers of playing cards, both in the lab and the field. Buyers are instructed to go to sellers and request the best possible card for a determined price. He claims that the appropriate variable to focus on is the difference in price sensitivity for non-local traders in the lab and the field. That is the difference in offered card quality for a given increase in offered price. As his first new finding, Camerer observes these effects as not being statistically different in both settings. @AlUbaydli.2013 argue that this was not the study’s focus, which tried to observe gift-giving, not reciprocity. This seems correct to me but misses the larger point. Camerer argues that one should not generalize to a field setting but to the general behaviour function, which he assumes to be parallel in lab and field. LL reject this assumption and argue that some characteristics of lab experiments influence behaviour in specific and biased ways that do not occur outside the lab, making lab experiments less suitable for generalization to the general behaviour function governing behaviour in all situations. In LL’s view, lab results give us information about human behaviour in labs but not necessarily much else as long as we do not specifically engage with the differences between the lab and every other setting. So what do the results of @List.2006 show? Are there behaviour differences in the lab and field? Yes, quite a few. Gift-giving is only observed in the lab, not the field. The impact of sellers being local or foreign differs systematically between settings. We should be aware of the burden of proof required to support the thesis that lab and field data do not vary systematically. Given a significance level (say 5%), one would need to show that less than 5% of results in comparisons between lab and field show significant differences. Camerer does not provide a systematic account of that (which might be difficult due to the lack of sample size), but his unsystematic list of results is not sufficient to support this claim. To illustrate the point, I requested the raw data from Prof. List, added a dummy variable for lab settings, and ran an OLS-regression analysis on the data from the direct comparison setup between lab and field. This very simple analysis shows whether the setting factor plays a significant role. The results imply it does - at a 1% significance level.

(3.2) & (3.3) The final part of Camerer’s analysis does not directly respond to LL but surveys the literature to assess if LL’s theoretical concerns make a difference in practice. Beyond [@List.2006], Camerer identifies six studies comparing lab and field setups directly and more than 20 studies comparing field setups with vaguely similar lab experiments. He reports to finding only one study with differing results in the closely matched setups and two studies with differing results for the less closely matched comparisons. These observations lead him to conclude that lab results produce data that reliably coincides with field findings, calling into question the warnings about lab generalizability. This conclusion suffers from two flaws we previously discussed before. The first of which is the selective reading of the literature (similar to section 2.1.2). The first paper Camerer describes to be a close match between lab and field is [@list2009], which finds cheating behaviour to be more common in a field setting as opposed to the corresponding lab setting. Camerer dismissed this finding as statistically insignificant, pointing out that explicit reporting on significance level was missing. This is indeed correct; in contrast to the more central finding of the paper, List gave only point estimates comparing the lab and field behaviour. However, just like for [@List.2006], the raw data is available on request and an OLS-regression estimating the effect of the setting show significant results at the 5 Similar to this paper, many of the results Camerer dismisses as not showing significant differences, appear to give much stronger evidence for differing results than Camerer’s characterisation of them indicates.

The second flaw in this conclusion is that it implies the burden of proof should be on the side warning about generalizability of lab experiments, while it should be on the one defending it. Even if only one in six closely matched setups produce different results between lab and field, this would be a reason to be cautious about generalizing without explicitly addressing potential significant differences between lab and field that might influence results. Level effects and differences in effect size can be very important for policy consideration.

Missing critiques

Beyond the problems with the critiques Camerer has brought forth, there were also some critiques that should have been made but were not. The most prominent among them is LL’s reliance on lab results to motivate their decision-making model. The results leading to the inclusion of variables like scrutiny or stakes in the model came not from the field, or a field-lab comparison, but from the lab. One example of this is the factor of scrutiny and the closely connected idea of experimenter demand effects. LL’s argument for including scrutiny in the utility function leans on the work of [@orne1959demand; @orne1959nature; @orne19621962]. Orne showed that altering experimental settings in a way that increases participants awareness of the experimenters’ preferences leads to behaviour that’s more in line with experimenters demands. LL demonstrate by their argumentation, that this lab result is significant in and of itself. It does not need field validation, and existence of the effect itself is the valuable information, it’s size. Many of such cases exist, where lab results alone have the capacity to move science forward in valuable, practical ways.

@Camerer.2011 and @Levitt.2007 are highly connected to various parts of the course. The most obvious connection is the discussion of general criticisms of lab experiments, which heavily features both the LL paper and the Camerer paper as a response. In addition to addressing LL and Camerer directly, Chapter 6 presented various studies relating to problems LL addressed.

Connections to LL

First, the lecture addresses the work of @Hoffman.1996, which explores the impact of scrutiny on behaviour in dictator games and shows that reducing the scrutiny in dictator games significantly decreases giving. This is linked to the concept of scrutiny in LL and experimenter demand effects, which LL cite as a potential problem in lab experiments. Closely related to these results, [@Berg.1995] show that the impact of reducing scrutiny is much weaker in trust games. Another piece of literature the course presents on experimenter demand effects is the work by @Bardsley.2008 , which studies how obfuscating the experiment’s aim also leads to less giving behaviour. This, too, serves as evidence for significant experimenter demand effects. Another criticism by LL regarding lab experiments is how their setup is often quite different from real-world situations; this might include the scrutiny and potential differences in behaviour due to the participants being endowed with money instead of risking their own money. Specifically, receiving an endowment from an experimenter might be associated with different social norms than deciding about one’s own money (“n” in the model). In the course, we learned about the work of @Cherry.2002 on this topic. @Cherry.2002 let one group of dictators work for their money and compare their giving behaviour with that of dictators receiving an endowment. They found a sharp decrease in giving behaviour when dictators had to work for their money. Under the lens of LL’s work, these results show significant differences in perceived social norms between lab experiments where participants receive an endowment and other situations in which people decide about their own money. Another key component of LL’s critique is the pool of participants used in lab experiments. They argue that differences between the demography of participants and that of the people whose real-world actions one tries to predict may lead to biased results. This topic was extensively covered in chapter 8 of the course. First, we were introduced to the work of @Roth.1991 examining the cross-country differences in behaviours in ultimatum games, finding modest differences in both offers and acceptance rates. Building on that, @Henrich.2001 performed a variety of games (including ultimatum games) with various small-scale traditional societies, finding more pronounced differences in behaviour. Offers are, on average, lower and the acceptance rate higher, which yields offers higher than profit-maximizing behaviour. Results of dictator and public good games also showed significant differences from the usual student samples. Further evidence for the significance of cultural differences is provided by @Herrmann.2008, who show different reactions to punishment across cultures. In addition to cross-cultural/country differences, there may be systematic differences in the behaviour of various demographic groups within a country. Many experiments are performed with university students; @Cappelen.2015 examined whether students’ behaviour differs significantly from the general population’s. They found students to be less prosocial and exhibiting smaller gender differences. Furthermore, motives like efficiency, equality, or reciprocity differed significantly from the population. This literature on the impact of culture and demography on behaviour aligns with LL’s warnings concerning the generalizability of results obtained from one group (e.g., students in Western countries) to others.

Connections to Camerer

The first part of Camerer’s critique develops the concepts of the scientific view and the policy view on economic research and accuses LL of (wrongly) taking a policy view, which argues for the importance of deriving real-world predictions from the research. This difference between LL and Camerer speaks directly to the question raised in lecture 10: "But do they [social-preference models] help economics or economic policy?" This is connected to the first critique, where Camerer described this question as "distracting". The course gives examples of how models, including social preferences, can make different predictions for outcomes than neo-classical models, (e.g. @ReyBiel.2008; @Dufwenberg.2011), and how this might affect predicted policy outcomes.

Acknowledgments

I thank Prof. List for kindly providing the raw data for [@List.2006; @list2009] . Data supporting this study’s findings are available upon reasonable request to the corresponding author.

References

  1. One of many examples is his work on “reference group neglect” in [@Camerer.1999], where he writes extensively on the real-world implications of lab findings.