Do Large Language Models Reason Causally Like Us? Even Better?

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) possess human-like—or even normative (Bayesian-optimal)—causal reasoning capabilities. Using probabilistic inference tasks based on collider graphs, we systematically evaluate GPT-4o, Claude, Gemini-Pro, and GPT-3.5 on canonical causal judgments such as “explaining away,” benchmarking their performance against human participants. Our results reveal, for the first time, a continuum of causal reasoning behavior across LLMs—from human-like to normative: GPT-4o and Claude most closely approximate Bayesian-optimal inference, with Claude exhibiting the smallest systematic deviation. All models significantly outperform humans in effect prediction accuracy, yet all retain consistent biases in reasoning under cause independence assumptions. This work establishes an empirical benchmark and provides theoretical insights into the mechanisms and boundaries of causal understanding in LLMs.

Technology Category

Application Category

📝 Abstract

Causal reasoning is a core component of intelligence. Large language models (LLMs) have shown impressive capabilities in generating human-like text, raising questions about whether their responses reflect true understanding or statistical patterns. We compared causal reasoning in humans and four LLMs using tasks based on collider graphs, rating the likelihood of a query variable occurring given evidence from other variables. We find that LLMs reason causally along a spectrum from human-like to normative inference, with alignment shifting based on model, context, and task. Overall, GPT-4o and Claude showed the most normative behavior, including"explaining away", whereas Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected independence of causes - Claude the least - they exhibited strong associative reasoning and predictive inference when assessing the likelihood of the effect given its causes. These findings underscore the need to assess AI biases as they increasingly assist human decision-making.

Problem

Research questions and friction points this paper is trying to address.

Evaluate causal reasoning in LLMs

Compare human and LLM reasoning

Assess AI biases in decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compared causal reasoning in humans and LLMs

Used collider graphs to assess query likelihood

Identified normative behavior in GPT-4o and Claude

🔎 Similar Papers

No similar papers found.