🤖 AI Summary
This study investigates whether large language models (LLMs) possess human-like—or even normative (Bayesian-optimal)—causal reasoning capabilities. Using probabilistic inference tasks based on collider graphs, we systematically evaluate GPT-4o, Claude, Gemini-Pro, and GPT-3.5 on canonical causal judgments such as “explaining away,” benchmarking their performance against human participants. Our results reveal, for the first time, a continuum of causal reasoning behavior across LLMs—from human-like to normative: GPT-4o and Claude most closely approximate Bayesian-optimal inference, with Claude exhibiting the smallest systematic deviation. All models significantly outperform humans in effect prediction accuracy, yet all retain consistent biases in reasoning under cause independence assumptions. This work establishes an empirical benchmark and provides theoretical insights into the mechanisms and boundaries of causal understanding in LLMs.
📝 Abstract
Causal reasoning is a core component of intelligence. Large language models (LLMs) have shown impressive capabilities in generating human-like text, raising questions about whether their responses reflect true understanding or statistical patterns. We compared causal reasoning in humans and four LLMs using tasks based on collider graphs, rating the likelihood of a query variable occurring given evidence from other variables. We find that LLMs reason causally along a spectrum from human-like to normative inference, with alignment shifting based on model, context, and task. Overall, GPT-4o and Claude showed the most normative behavior, including"explaining away", whereas Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected independence of causes - Claude the least - they exhibited strong associative reasoning and predictive inference when assessing the likelihood of the effect given its causes. These findings underscore the need to assess AI biases as they increasingly assist human decision-making.