🤖 AI Summary
Large language models often rely on semantic associations rather than causal reasoning in complex decision-making, leading to errors in high-stakes scenarios. To address this limitation, this work introduces CausalFlip, a novel benchmark that decouples semantic memory from genuine causal understanding through event-triplet-based causal judgment pairs and a noise-prefix evaluation mechanism. The approach integrates causal graph structures—covering confounding, chain, and collider patterns—with contrastive question construction and internalized causal reasoning training. This framework substantially outperforms conventional answer supervision and explicit chain-of-thought prompting. Experimental results demonstrate its superior robustness against semantic interference and its effectiveness in enhancing the accuracy of causal judgments.
📝 Abstract
As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models' reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.