๐ค AI Summary
This study systematically evaluates 45 large language models (LLMs) across eight classic cognitive biasesโincluding anchoring, availability, and confirmation bias. We introduce the first scalable, reproducible, mechanism-oriented evaluation framework, built upon a psychologist-curated dataset of 220 decision-making scenarios (yielding 2.8 million model responses). The framework integrates multiple-choice judgment tasks, human-designed prompt templates with automated augmentation, and controlled-variable prompting techniques. Results show that LLMs exhibit statistically significant bias consistency in 17.8%โ57.3% of cases. Scaling model parameters beyond 32B reduces bias in 39.5% of scenarios, whereas increasing prompt specificity yields at most a 14.9% reduction. This work establishes the first rigorous, large-scale assessment of cognitive biases in LLMs, offering both theoretical foundations and methodological tools for developing trustworthy AI decision-making systems.
๐ Abstract
As Large Language Models (LLMs) are increasingly embedded in real-world decision-making processes, it becomes crucial to examine the extent to which they exhibit cognitive biases. Extensively studied in the field of psychology, cognitive biases appear as systematic distortions commonly observed in human judgments. This paper presents a large-scale evaluation of eight well-established cognitive biases across 45 LLMs, analyzing over 2.8 million LLM responses generated through controlled prompt variations. To achieve this, we introduce a novel evaluation framework based on multiple-choice tasks, hand-curate a dataset of 220 decision scenarios targeting fundamental cognitive biases in collaboration with psychologists, and propose a scalable approach for generating diverse prompts from human-authored scenario templates. Our analysis shows that LLMs exhibit bias-consistent behavior in 17.8-57.3% of instances across a range of judgment and decision-making contexts targeting anchoring, availability, confirmation, framing, interpretation, overattribution, prospect theory, and representativeness biases. We find that both model size and prompt specificity play a significant role on bias susceptibility as follows: larger size (>32B parameters) can reduce bias in 39.5% of cases, while higher prompt detail reduces most biases by up to 14.9%, except in one case (Overattribution), which is exacerbated by up to 8.8%.