🤖 AI Summary
It remains unclear whether high scores of large language models (LLMs) on reasoning benchmarks reflect genuine causal reasoning capabilities or merely statistical memorization of training data.
Method: We propose RE-IMAGINE, the first framework to systematically integrate Pearl’s three-layer causal hierarchy—association, intervention, and counterfactual reasoning—into LLM evaluation. It automatically generates memory-robust question variants via symbolic intermediate representations, covering mathematics, programming, and logic. The approach combines symbolic program synthesis, intermediate representation transformation, and cross-domain adaptation to overcome limitations of static benchmarks.
Contribution/Results: Evaluated across four major reasoning benchmarks on diverse LLM families, RE-IMAGINE reveals substantial performance degradation at the intervention and counterfactual layers—demonstrating that current high scores heavily rely on statistical memorization rather than true causal reasoning. This work establishes a novel, hierarchically grounded standard for rigorous assessment of compositional and causal reasoning abilities in LLMs.
📝 Abstract
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.