Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing causal reasoning benchmarks rely heavily on synthetic data and exhibit narrow domain coverage, limiting their ability to assess large language models’ (LLMs) understanding of real-world causal relationships. Method: We introduce the first multi-domain causal benchmark grounded in empirical studies from top-tier economics and finance journals, spanning health, environment, technology, law, and culture. It comprises 40,379 authentic causal instances extracted via rigorous econometric methods—including instrumental variables, difference-in-differences, and regression discontinuity—and supports five distinct causal reasoning tasks. Contribution/Results: This benchmark overcomes the limitations of synthetic data and enables the first systematic evaluation of LLMs’ real-world causal identification capability. Experiments across eight state-of-the-art models reveal a maximum accuracy of only 57.6%; model scale shows no significant correlation with performance, indicating that even advanced LLMs fundamentally lack robust causal identification capacity.

Technology Category

Application Category

📝 Abstract

Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' causal reasoning using scientifically validated economic relationships

Addressing limitations of synthetic data and narrow domain coverage in benchmarks

Assessing model performance gaps in identifying fundamental cause-effect relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using instrumental variables for causal identification

Applying difference-in-differences methodology in evaluation

Implementing regression discontinuity designs in benchmark

🔎 Similar Papers

Causal Inference with Large Language Model: A Survey