Alignment Makes Language Models Normative, Not Descriptive

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study investigates whether alignment training enhances language models’ accuracy in predicting human behavior, particularly in multi-round strategic interactions. By comparing 120 pairs of base and aligned models across more than 10,000 real human decisions in diverse strategic settings—including bargaining, persuasion, and repeated matrix games—and employing varied prompting strategies, the work reveals for the first time that alignment biases models toward normative rather than descriptive reasoning. Aligned models outperform base models in single-round, non-strategic, or normative tasks, yet exhibit substantially lower predictive accuracy in multi-round strategic interactions, with a performance gap approaching 10:1. This finding challenges the prevailing assumption that alignment inherently improves a model’s capacity to model actual human behavior.

Technology Category

Application Category

📝 Abstract

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

Problem

Research questions and friction points this paper is trying to address.

alignment

language models

human behavior prediction

normative bias

strategic games

Innovation

Methods, ideas, or system contributions that make the work stand out.

alignment

normative bias

human behavior prediction