AC-Reason: Towards Theory-Guided Actual Causality Reasoning with Large Language Models

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language model (LLM) approaches to actual causality (AC) reasoning lack grounding in formal AC theory, resulting in poor interpretability and low faithfulness. Method: We propose AC-Reason—a semi-formal framework that deeply integrates formal AC theory into LLM reasoning—by identifying causal events, quantifying sufficiency, necessity, and normality, and designing axiom-driven query solving and controllable variable manipulation mechanisms without requiring explicit causal graphs. We further construct AC-Bench, a benchmark of ~1K fine-grained, expert-annotated instances for faithful, fine-grained evaluation. Results: GPT-4 augmented with AC-Reason achieves 75.04% accuracy on BBH-CJ (surpassing human average of 69.60%) and 71.82% on AC-Bench. Ablation studies confirm that formal theory integration contributes most to performance gains. Moreover, we首次 uncover shortcut reliance in GPT-4, while newer models (e.g., GPT-4o) exhibit significantly higher reasoning faithfulness.

Technology Category

Application Category

📝 Abstract
Actual causality (AC), a fundamental aspect of causal reasoning (CR), is responsible for attribution and responsibility assignment in real-world scenarios. However, existing LLM-based methods lack grounding in formal AC theory, resulting in limited interpretability. Therefore, we propose AC-Reason, a semi-formal reasoning framework that identifies causally relevant events within an AC scenario, infers the values of their formal causal factors (e.g., sufficiency, necessity, and normality), and answers AC queries via a theory-guided algorithm with explanations. While AC-Reason does not explicitly construct a causal graph, it operates over variables in the underlying causal structure to support principled reasoning. To enable comprehensive evaluation, we introduce AC-Bench, a new benchmark built upon and substantially extending Big-Bench Hard Causal Judgment (BBH-CJ). AC-Bench comprises ~1K carefully annotated samples, each with detailed reasoning steps and focuses solely on actual causation. The case study shows that synthesized samples in AC-Bench present greater challenges for LLMs. Extensive experiments on BBH-CJ and AC-Bench show that AC-Reason consistently improves LLM performance over baselines. On BBH-CJ, all tested LLMs surpass the average human rater accuracy of 69.60%, with GPT-4 + AC-Reason achieving 75.04%. On AC-Bench, GPT-4 + AC-Reason again achieves the highest accuracy of 71.82%. AC-Bench further enables fine-grained analysis of reasoning faithfulness, revealing that only Qwen-2.5-72B-Instruct, Claude-3.5-Sonnet, and GPT-4o exhibit faithful reasoning, whereas GPT-4 tends to exploit shortcuts. Finally, our ablation study proves that integrating AC theory into LLMs is highly effective, with the proposed algorithm contributing the most significant performance gains.
Problem

Research questions and friction points this paper is trying to address.

Lack of formal AC theory grounding in LLM-based causality methods
Need for interpretable actual causality reasoning framework
Absence of comprehensive benchmark for evaluating AC reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-formal reasoning framework for causality
Theory-guided algorithm with explanations
New benchmark AC-Bench for evaluation
🔎 Similar Papers
No similar papers found.