Behavior Bench (Beta)

How do AI models behave compared to humans?

Behavior Bench is an AI benchmark modeled after classic experiments from behavioral economics. Rather than testing reasoning or factual recall, it measures the preferences AI models reveal when making decisions under uncertainty, trade-offs across time, and choices involving others.

The benchmark recreates the methodology of Falk et al. (2018), whose Global Preferences Survey (GPS) measured six fundamental behavioral dimensions across 80,000 people in 76 countries. Six dimensions measured:

  • Risk Tolerance — willingness to take gambles vs. prefer certainty
  • Patience — preference for delayed vs. immediate rewards
  • Positive Reciprocity — tendency to return favors
  • Negative Reciprocity — tendency to retaliate against unfair treatment
  • Altruism — willingness to give to others at personal cost
  • Trust — baseline willingness to trust strangers

Each model was run through the same incentivized choice tasks used with human participants. Scores are normalized to 0–10 (matching the original GPS scale).

Scores on a 0–10 scale matching the GPS (Global Preferences Survey) instrument.

Models ranked by score. Error bars = ±1 SD across simulated participants. Dashed line = human world average (GPS, Falk et al. 2018).

Lab scores are averages across all models from that lab.

Top 3 closest countries per model by Euclidean distance in 6-dimensional GPS score space.

Scale 0–1. Each axis is independently rescaled: 0 (raw) → 0, reference → 0.5, 10 (raw) → 1.

PCA projection into 2 dimensions. Each point represents a behavioural profile in the 6-dimensional GPS score space. Key finding: PC1 separates AI models (right) from humans (left) — models systematically differ from all countries.