🤖 AI Summary
Large language models (LLMs) exhibit susceptibility to internal biases in knowledge-intensive tasks, leading to erroneous reasoning. To address this, we propose CFD-Prompting—a conditional front-door adjustment prompting framework that integrates retrieval-augmented generation (RAG), chain-of-thought (CoT) reasoning, and causal inference. Its core innovation lies in leveraging counterfactual external knowledge to estimate causal effects without requiring explicit intervention variables, enabling indirect causal intervention on fixed queries under weaker assumptions. By simulating counterfactual responses across varying knowledge conditions, CFD-Prompting mitigates inherent model biases. Extensive experiments across multiple state-of-the-art LLMs and benchmark datasets demonstrate that CFD-Prompting significantly improves both accuracy and robustness, consistently outperforming existing RAG, CoT, and causally enhanced baselines.
📝 Abstract
Large Language Models (LLMs) have shown impressive capabilities in natural language processing but still struggle to perform well on knowledge-intensive tasks that require deep reasoning and the integration of external knowledge. Although methods such as Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) have been proposed to enhance LLMs with external knowledge, they still suffer from internal bias in LLMs, which often leads to incorrect answers. In this paper, we propose a novel causal prompting framework, Conditional Front-Door Prompting (CFD-Prompting), which enables the unbiased estimation of the causal effect between the query and the answer, conditional on external knowledge, while mitigating internal bias. By constructing counterfactual external knowledge, our framework simulates how the query behaves under varying contexts, addressing the challenge that the query is fixed and is not amenable to direct causal intervention. Compared to the standard front-door adjustment, the conditional variant operates under weaker assumptions, enhancing both robustness and generalisability of the reasoning process. Extensive experiments across multiple LLMs and benchmark datasets demonstrate that CFD-Prompting significantly outperforms existing baselines in both accuracy and robustness.