Behavior Bench (Beta)

How do AI models behave compared to humans?

Behavior Bench is an AI benchmark modeled after classic experiments from behavioral economics. Rather than testing reasoning or factual recall, it measures the preferences AI models reveal when making decisions under uncertainty, trade-offs across time, and choices involving others.

The benchmark recreates the methodology of Falk et al. (2018), whose Global Preferences Survey (GPS) measured six fundamental behavioral dimensions across 80,000 people in 76 countries. Six dimensions measured:

Risk Tolerance — willingness to take gambles vs. prefer certainty
Patience — preference for delayed vs. immediate rewards
Positive Reciprocity — tendency to return favors
Negative Reciprocity — tendency to retaliate against unfair treatment
Altruism — willingness to give to others at personal cost
Trust — baseline willingness to trust strangers

Each model was run through the same incentivized choice tasks used with human participants. Scores are normalized to 0–10 (matching the original GPS scale).

Select models to compare (up to 10)

Filter by lab:

Scores on a 0–10 scale matching the GPS (Global Preferences Survey) instrument.

Dimension

Error bars (±1 SD) Human world avg

Models ranked by score. Error bars = ±1 SD across simulated participants. Dashed line = human world average (GPS, Falk et al. 2018).

Select labs to compare

Lab scores are averages across all models from that lab.

Filter by lab:

Top 3 closest countries per model by Euclidean distance in 6-dimensional GPS score space.

Metric:

Reference baseline — mapped to 0.5 on every axis

Models to display (up to 10)

Filter by lab:

Scale 0–1. Each axis is independently rescaled: 0 (raw) → 0, reference → 0.5, 10 (raw) → 1.

Show countries Country labels

PCA projection into 2 dimensions. Each point represents a behavioural profile in the 6-dimensional GPS score space. Key finding: PC1 separates AI models (right) from humans (left) — models systematically differ from all countries.