Introduction
This term paper was written as part of the “Social Preference” course at Humboldt Universität zu Berlin. It addresses a relatively recent controversy in empirical economics with wide-ranging consequences for the way future research will be conducted. The debate centers around publications by Steven D. Levitt and John A. List which have been perceived as critical of laboratory setups for economic experiments. Colin Camerer answered Levitt and List, laying out various arguments favoring lab experiments. Camerer’s publication will be the focus of this analysis. The first part will summarize the positions of Levitt and List. Then, the second part will summarize Camerer’s critique of their position before critiquing Camerer’s critique and connecting the paper to our Social Preference course in the third and fourth parts.
Summary of the Levitt & List position
Camerer seeks to address the claims made by Steven D. Levitt and John A. List in three of their publications [@Levitt.2007; @Levitt.2007b; @Levitt.2008] (LL). He primarily focuses on one of these papers titled: "Viewpoint: On the Generalizability of Lab Behavior to the Field” [@Levitt.2007]. Hence, I, too, will focus on the arguments brought forward in this publication. The authors describe their goal with this work to: “summarize, in a provocative manner, some of the important factors at work when extrapolating results from laboratory experiments to the field.” In particular, they focus on four aspects in which the environment in the lab typically differs from the real world and how these differences might influence the generalizability of lab results. To think about these four differences in a structured way, they propose a model of decision-making. Agents are utility maximizing with a utility function:
\[U_i(a, v, n, s) = M_i(a, v, n, s) + W_i(a, v) - c\]Their utility depends on their actions (a) via two channels. One is the wealth effect ($W_i$), which depends on the action and is an increasing function of the stakes of the decision (v). The second effect is the non-monetary moral cost of the action ($M_i$). This effect is a function of the selected action as well as the magnitude of the negative impact the decision has on others (v), the set of social norms (n), and scrutiny (s). Furthermore, in the model, cognitive costs (c) are assumed to exist and to increase with the difficulty of decision-making.
LL name four key differences between lab and reality, which are summarized below:
Stakes
Levitt and Lists note that when the stakes of real-world situations cannot be replicated in the lab, one can not necessarily assume lab results to generalize to non-experimental situations. In contrast to other disciplines, it is common for economics experiments to have some monetary pay-out for participants depending on their choices. The model assumes that people respond to (monetary) incentives. In the utility equation (1), we see that the situation’s stakes (v) are a factor in utility optimization.
Pleasing the experimenter
Unlike in the real world, in lab experiments, participants know that an experimenter monitors their actions. This alters the scrutiny felt in the situation, hence changing the utility function. This systematically different level of scrutiny would result in systematically different actions by the participants as opposed to agents in real-world situations.
Learning effects
There is a practical limitation for the duration of lab experiments. As the utility function (1) states, decision-making is associated with cognitive costs. For one-time decisions this cognitive cost might be too high to justify searching for the theoretically optimal action. Still, in real-world situations where agents are often repeatedly confronted with similar situations, marginal cognitive costs decrease, which leads to more optimal decision-making. Levitt and List argue that when lab experiments cannot replicate the possibility of accumulating learning effects, we should not expect lab results to be necessarily equivalent to the behaviour observed in the real world.
Selection effects
In economics, lab experiments are typically performed in a university setting with students as participants. The specific forms and weights of utility functions differ from person to person. This does not inhibit the generalizability of results as long as the utility functions of the people tested do not differ systematically from the group one might want to generalize to. LL state that for most questions, this assumption will not hold. The subset of students participating in economics experiments is significantly different from the whole student population, let alone the population at large.
Conclusion
List and Levitt advise caution when generalizing lab experiments for these four main reasons. They predict that "behaviour will converge across situations as the economically and psychologically relevant factors converge” while warning that “relevant factors will rarely converge across the lab and many field settings." They conclude that”at a minimum, lab experiments can provide a crucial first understanding of qualitative effects, suggest underlying mechanisms that might be at work when certain data patterns are observed, provide insights into what can happen, and evoke empirical puzzles.” [@Levitt.2007]
Camerer’s Critique
Camerer’s critique [-@Camerer.2011] features three main arguments. (1) Generalizability is not a main goal for lab experiments. (2) Most features that might compromise the generalizability of lab findings, according to Levitt & List, are not unique to lab experiments, and (3) literature shows that lab-field generalizability is often quite good. In the following, I will examine the merit of each of those claims and show the connections to the claims made by Levitt and List.
1. Generalizability is not a primary concern for lab experiments
Camerer proposes two viewpoints on experimental economics. The scientific view is that “all empirical studies contribute evidence about the general way in which [economic factors] […] influence economic behaviour." The policy view stresses generalizability as it aims to use the knowledge for policy actions. Camerer asserts that Levitt and List subscribe to the policy view, while most experimentalists hold the scientific view.
2. Field experiments suffer from the same flaws in generalizability
YCamerer’s second main argument asserts that factors that might limit the generalizability of lab experiments to the field also create problems for generalizing from field results to other field applications (2.1). He further states that all factors (except for obtrusive observation) are not necessarily part of lab experiments and do not necessarily impact generalizability to the field (2.2).
3. The empirical evidence for differences in lab and field is weak
This argument has three parts that all engage with the current literature on generalizability. First, the initial study that sought to create similar setups for experiments both in the field and the lab to compare results [@List.2006] observed significant differences in lab and field behaviour. Camerer claims this finding to not be statistically reliable based on new, previously unreported analysis. (3.1) Second, other experiments that try to create similar situations to compare behaviour in the lab with behaviour in the field include just one study that gives conclusive evidence in favor of differences in behaviour. (3.2) Third, for papers that compare lab and field results without closely matching the circumstances, more than 20 studies find good comparability, while only 2 find very different results in lab and field. (3.3)
Validity of Critiques
(1) Arguing LL do not adhere to the scientific view described by
Camerer misses the core of their argument. In their writings, Levitt and
List do not entertain the thought that external validity should be a
prerequisite for lab experiments or that experiments aiming to find
general principles not applicable in natural environments should not
exist. They provide a model to consider which factors in the
experimental setup might promote or inhibit generalizability to the
field. Their writings do not imply that every experiment has to have
perfect external validity. It is unclear if Camerer thinks contemplating
external validity is really "distracting" and should be avoided, as it
is a net negative for scientific progress. We may assume this is not the
case, as a quick search of his publications shows him frequently
contemplating external validity of his experiments.1
(2.1) When arguing that field-field generalizability is similarly
problematic to lab-field generalizability, [@Camerer.2011] presents an
example regarding dictator games and charitable giving. Dictator games
are a widely used setup in lab experiments, the results of which show
substantial selfless giving. These results have been criticized as they
seem to be at odds with the much lower levels of charitable giving
observed in the real world. He rejects that critique by arguing that
even though people might have interpreted the results of dictator game
lab experiments as altruism, comparing these results with charitable
giving of earned income in the real world was never reasonable. He then
argues in favor of lab experiments: "The nature of entitlements,
deservingness, stakes, and obtrusiveness […] can all be controlled
much more carefully than in most field settings”. This is true and
supports the point LL make. Factors like stakes (v) and scrutiny (s)
should be actively considered when setting up lab experiments and
generalizing them from the lab to the field. Suppose one is trying to
observe social preferences for charitable giving in the real world (as
Camerer assumes in this example). In that case, a lab experiment is
unlikely to give a useful result as factors like the level of scrutiny
might be varied in the lab. However, this does not provide any
information about the quantitative results one can expect in the real
world as the real-world level of scrutiny is unknown. Hence, only field
experiments could generate results that predict further field behaviour,
even if the generalizability is quite narrow.
(2.1.1) On scrutiny specifically, Camerer make some arguments, which
serve as useful example of problems in other parts of the paper. For the
factor of scrutiny to be impactful, Camerer argues subjects would have
to “(a) have a view of what hypothesis the experimenter favors
(or”demands"); and (b) be willing to sacrifice money to help prove the
experimenter’s hypothesis.” He goes on to argue that “condition (a) is
just is [sic] not likely to hold because subjects have no consistent
view about what the experimenter expects." If we accept the proposition
that subjects’ views are inconsistent for the sake of argument, they
could still impact the results. Going back to the example of dictator
games, imagine no subject has any intrinsic desire to give in a dictator
game; 40% of participants think the experimenter expects them to give
half their endowment, while 60% believe the experimenter expects them to
give nothing. In this case, the participants have inconsistent views of
what the experimenter might expect, and still, a minority of
participants would skew the results dramatically. This setup would
explain the observed giving in dictator games without the need for
consistent views of the experimenter’s expectation. Hence, condition (a)
does not have to hold for scrutiny to affect the results of an
experiment. The argumentation is reminiscent of the popular debate
tactic of "false premise setting." The argument focuses on whether
expectations of experimenter demand are consistent across participants
while pretending that consistency is required to change the results,
which it is not. This specific form of straw-manning is emblematic of
Camerer’s persistent argument against the worthlessness of lab
experiments, while this is a much more extreme and less nuanced claim
than any argument written by LL.
(2.1.2) Regarding condition (b), Camerer argues that if subjects prefer
to fulfill experimenters’ expectations, the effect will shrink with
increasing stakes. He cites @Camerer.1999c, arguing that raising stakes
has little effect. After extensively studying the paper, I found that
most of the analyzed papers study the effect of increased financial
incentives on the performance of cognitive or physical tasks. The
difference from the example of the dictator game is that in these cases,
the motivation to do well to please the experimenter and the motivation
to do well to earn more money are in line with each other, as opposed to
the dictator game where pleasing the experimenter might come at the cost
of personal financial gain. These experiments do not indicate if the
increase in financial rewards for good performance reduces the impact of
the preference to please the experimenter. In fact, among the 74 papers
considered by Camerer and Hogarth, two observe the change in behaviour
through increased financial stakes in a dictator game. Both find
significantly less social giving with increased stakes
[@Forsythe.1994; @Sefton.1992]. These papers show that in situations
where performance and experimenter expectations are aligned increasing
financial rewards often reduces performance. This result is explained by
the financial rewards reframing the situation and crowing out the
(stronger) intrinsic motivation but could also be explained by the
financial reward reframing the situation and crowding out the preference
to please the experimenter. With increasing rewards, the preference to
please the experimenter might decline, making these results artefacts of
laboratory conditions. To summarize, the cited meta-analysis is not only
inapplicable in large parts, but the small subset of papers analysed
that speak to Camerer’s argument explicitly contradict his thesis,
showing the exact opposite of what his thesis would predict. Beyond
that, taking experimenter demand effects seriously calls into question
the interpretation of the whole meta-analysis, as they provide a
competing explanation for intrinsic motivation. This example of a quite
selective reading of the literature is especially egregious, considering
he is the lead author of the paper cited, and thus deserves to be
addressed in more detail.
(2.2) It might technically be true that the aspects of lab design LL
criticize are not necessary components of such, but even so, it is
hardly a critique of LL’s argument. Camerer himself describes what he
calls the”common design” of lab experiments as follows: “Typically
behavior is observed obtrusively, decisions are described abstractly,
subjects are self-selected volunteers from convenience samples (e.g.,
college students), and per-hour financial incentives are modest.” Even
if those characteristics are not by definition linked to lab
experiments, they are the current standard and part of the vast majority
of experiments, making Camerer’s thesis theoretically valid but
pragmatically ineffectual. LL argues that those characteristics create
problems in generalizability and explicitly promote the creation of lab
experiments whose characteristics fit closer to the real world. They do
not argue that lab experiments are inherently bad but raise awareness
for specific factors in lab design that might practically impact
generalizability.
(3.1) We established that thinking about external validity can be
worthwhile and that factors like the existence of an experimenter, the
added level of scrutiny, and the atypical demography of participants, as
well as the generally low stakes, are factors that might reasonably be
considered when generalizing from the lab to the field. The literature
indicating how large the differences between lab and field results might
be. Camerer focuses on a single paper which tried to create analogous
experiments in a lab and field setting to observe differences in
behaviour [@List.2006]. After requesting a re-examining of List’s data,
Camerer claims two new findings. The experiment observes the interaction
of buyers and sellers of playing cards, both in the lab and the field.
Buyers are instructed to go to sellers and request the best possible
card for a determined price. He claims that the appropriate variable to
focus on is the difference in price sensitivity for non-local traders in
the lab and the field. That is the difference in offered card quality
for a given increase in offered price. As his first new finding, Camerer
observes these effects as not being statistically different in both
settings. @AlUbaydli.2013 argue that this was not the study’s focus,
which tried to observe gift-giving, not reciprocity. This seems correct
to me but misses the larger point. Camerer argues that one should not
generalize to a field setting but to the general behaviour function,
which he assumes to be parallel in lab and field. LL reject this
assumption and argue that some characteristics of lab experiments
influence behaviour in specific and biased ways that do not occur
outside the lab, making lab experiments less suitable for generalization
to the general behaviour function governing behaviour in all situations.
In LL’s view, lab results give us information about human behaviour in
labs but not necessarily much else as long as we do not specifically
engage with the differences between the lab and every other setting. So
what do the results of @List.2006 show? Are there behaviour differences
in the lab and field? Yes, quite a few. Gift-giving is only observed in
the lab, not the field. The impact of sellers being local or foreign
differs systematically between settings. We should be aware of the
burden of proof required to support the thesis that lab and field data
do not vary systematically. Given a significance level (say 5%), one
would need to show that less than 5% of results in comparisons between
lab and field show significant differences. Camerer does not provide a
systematic account of that (which might be difficult due to the lack of
sample size), but his unsystematic list of results is not sufficient to
support this claim. To illustrate the point, I requested the raw data
from Prof. List, added a dummy variable for lab settings, and ran an
OLS-regression analysis on the data from the direct comparison setup
between lab and field. This very simple analysis shows whether the
setting factor plays a significant role. The results imply it does - at
a 1% significance level.
(3.2) & (3.3) The final part of Camerer’s analysis does not directly
respond to LL but surveys the literature to assess if LL’s theoretical
concerns make a difference in practice. Beyond [@List.2006], Camerer
identifies six studies comparing lab and field setups directly and more
than 20 studies comparing field setups with vaguely similar lab
experiments. He reports to finding only one study with differing results
in the closely matched setups and two studies with differing results for
the less closely matched comparisons. These observations lead him to
conclude that lab results produce data that reliably coincides with
field findings, calling into question the warnings about lab
generalizability. This conclusion suffers from two flaws we previously
discussed before. The first of which is the selective reading of the
literature (similar to section 2.1.2). The first paper Camerer describes
to be a close match between lab and field is [@list2009], which finds
cheating behaviour to be more common in a field setting as opposed to
the corresponding lab setting. Camerer dismissed this finding as
statistically insignificant, pointing out that explicit reporting on
significance level was missing. This is indeed correct; in contrast to
the more central finding of the paper, List gave only point estimates
comparing the lab and field behaviour. However, just like for
[@List.2006], the raw data is available on request and an OLS-regression
estimating the effect of the setting show significant results at the 5
Similar to this paper, many of the results Camerer dismisses as not
showing significant differences, appear to give much stronger evidence
for differing results than Camerer’s characterisation of them indicates.
The second flaw in this conclusion is that it implies the burden of proof should be on the side warning about generalizability of lab experiments, while it should be on the one defending it. Even if only one in six closely matched setups produce different results between lab and field, this would be a reason to be cautious about generalizing without explicitly addressing potential significant differences between lab and field that might influence results. Level effects and differences in effect size can be very important for policy consideration.
Missing critiques
Beyond the problems with the critiques Camerer has brought forth, there were also some critiques that should have been made but were not. The most prominent among them is LL’s reliance on lab results to motivate their decision-making model. The results leading to the inclusion of variables like scrutiny or stakes in the model came not from the field, or a field-lab comparison, but from the lab. One example of this is the factor of scrutiny and the closely connected idea of experimenter demand effects. LL’s argument for including scrutiny in the utility function leans on the work of [@orne1959demand; @orne1959nature; @orne19621962]. Orne showed that altering experimental settings in a way that increases participants awareness of the experimenters’ preferences leads to behaviour that’s more in line with experimenters demands. LL demonstrate by their argumentation, that this lab result is significant in and of itself. It does not need field validation, and existence of the effect itself is the valuable information, it’s size. Many of such cases exist, where lab results alone have the capacity to move science forward in valuable, practical ways.
Links to the course
@Camerer.2011 and @Levitt.2007 are highly connected to various parts of the course. The most obvious connection is the discussion of general criticisms of lab experiments, which heavily features both the LL paper and the Camerer paper as a response. In addition to addressing LL and Camerer directly, Chapter 6 presented various studies relating to problems LL addressed.
Connections to LL
First, the lecture addresses the work of @Hoffman.1996, which explores the impact of scrutiny on behaviour in dictator games and shows that reducing the scrutiny in dictator games significantly decreases giving. This is linked to the concept of scrutiny in LL and experimenter demand effects, which LL cite as a potential problem in lab experiments. Closely related to these results, [@Berg.1995] show that the impact of reducing scrutiny is much weaker in trust games. Another piece of literature the course presents on experimenter demand effects is the work by @Bardsley.2008 , which studies how obfuscating the experiment’s aim also leads to less giving behaviour. This, too, serves as evidence for significant experimenter demand effects. Another criticism by LL regarding lab experiments is how their setup is often quite different from real-world situations; this might include the scrutiny and potential differences in behaviour due to the participants being endowed with money instead of risking their own money. Specifically, receiving an endowment from an experimenter might be associated with different social norms than deciding about one’s own money (“n” in the model). In the course, we learned about the work of @Cherry.2002 on this topic. @Cherry.2002 let one group of dictators work for their money and compare their giving behaviour with that of dictators receiving an endowment. They found a sharp decrease in giving behaviour when dictators had to work for their money. Under the lens of LL’s work, these results show significant differences in perceived social norms between lab experiments where participants receive an endowment and other situations in which people decide about their own money. Another key component of LL’s critique is the pool of participants used in lab experiments. They argue that differences between the demography of participants and that of the people whose real-world actions one tries to predict may lead to biased results. This topic was extensively covered in chapter 8 of the course. First, we were introduced to the work of @Roth.1991 examining the cross-country differences in behaviours in ultimatum games, finding modest differences in both offers and acceptance rates. Building on that, @Henrich.2001 performed a variety of games (including ultimatum games) with various small-scale traditional societies, finding more pronounced differences in behaviour. Offers are, on average, lower and the acceptance rate higher, which yields offers higher than profit-maximizing behaviour. Results of dictator and public good games also showed significant differences from the usual student samples. Further evidence for the significance of cultural differences is provided by @Herrmann.2008, who show different reactions to punishment across cultures. In addition to cross-cultural/country differences, there may be systematic differences in the behaviour of various demographic groups within a country. Many experiments are performed with university students; @Cappelen.2015 examined whether students’ behaviour differs significantly from the general population’s. They found students to be less prosocial and exhibiting smaller gender differences. Furthermore, motives like efficiency, equality, or reciprocity differed significantly from the population. This literature on the impact of culture and demography on behaviour aligns with LL’s warnings concerning the generalizability of results obtained from one group (e.g., students in Western countries) to others.
Connections to Camerer
The first part of Camerer’s critique develops the concepts of the scientific view and the policy view on economic research and accuses LL of (wrongly) taking a policy view, which argues for the importance of deriving real-world predictions from the research. This difference between LL and Camerer speaks directly to the question raised in lecture 10: "But do they [social-preference models] help economics or economic policy?" This is connected to the first critique, where Camerer described this question as "distracting". The course gives examples of how models, including social preferences, can make different predictions for outcomes than neo-classical models, (e.g. @ReyBiel.2008; @Dufwenberg.2011), and how this might affect predicted policy outcomes.
Acknowledgments
I thank Prof. List for kindly providing the raw data for [@List.2006; @list2009] . Data supporting this study’s findings are available upon reasonable request to the corresponding author.
References
-
One of many examples is his work on “reference group neglect” in [@Camerer.1999], where he writes extensively on the real-world implications of lab findings. ↩