🤖 AI Summary
Existing causal reasoning benchmarks rely heavily on synthetic data and exhibit narrow domain coverage, limiting their ability to assess large language models’ (LLMs) understanding of real-world causal relationships. Method: We introduce the first multi-domain causal benchmark grounded in empirical studies from top-tier economics and finance journals, spanning health, environment, technology, law, and culture. It comprises 40,379 authentic causal instances extracted via rigorous econometric methods—including instrumental variables, difference-in-differences, and regression discontinuity—and supports five distinct causal reasoning tasks. Contribution/Results: This benchmark overcomes the limitations of synthetic data and enables the first systematic evaluation of LLMs’ real-world causal identification capability. Experiments across eight state-of-the-art models reveal a maximum accuracy of only 57.6%; model scale shows no significant correlation with performance, indicating that even advanced LLMs fundamentally lack robust causal identification capacity.
📝 Abstract
Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.