AC-Reason: Towards Theory-Guided Actual Causality Reasoning with Large Language Models

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing large language model (LLM) approaches to actual causality (AC) reasoning lack grounding in formal AC theory, resulting in poor interpretability and low faithfulness. Method: We propose AC-Reason—a semi-formal framework that deeply integrates formal AC theory into LLM reasoning—by identifying causal events, quantifying sufficiency, necessity, and normality, and designing axiom-driven query solving and controllable variable manipulation mechanisms without requiring explicit causal graphs. We further construct AC-Bench, a benchmark of ~1K fine-grained, expert-annotated instances for faithful, fine-grained evaluation. Results: GPT-4 augmented with AC-Reason achieves 75.04% accuracy on BBH-CJ (surpassing human average of 69.60%) and 71.82% on AC-Bench. Ablation studies confirm that formal theory integration contributes most to performance gains. Moreover, we首次 uncover shortcut reliance in GPT-4, while newer models (e.g., GPT-4o) exhibit significantly higher reasoning faithfulness.

Technology Category

Application Category

📝 Abstract

Actual causality (AC), a fundamental aspect of causal reasoning (CR), is responsible for attribution and responsibility assignment in real-world scenarios. However, existing LLM-based methods lack grounding in formal AC theory, resulting in limited interpretability. Therefore, we propose AC-Reason, a semi-formal reasoning framework that identifies causally relevant events within an AC scenario, infers the values of their formal causal factors (e.g., sufficiency, necessity, and normality), and answers AC queries via a theory-guided algorithm with explanations. While AC-Reason does not explicitly construct a causal graph, it operates over variables in the underlying causal structure to support principled reasoning. To enable comprehensive evaluation, we introduce AC-Bench, a new benchmark built upon and substantially extending Big-Bench Hard Causal Judgment (BBH-CJ). AC-Bench comprises ~1K carefully annotated samples, each with detailed reasoning steps and focuses solely on actual causation. The case study shows that synthesized samples in AC-Bench present greater challenges for LLMs. Extensive experiments on BBH-CJ and AC-Bench show that AC-Reason consistently improves LLM performance over baselines. On BBH-CJ, all tested LLMs surpass the average human rater accuracy of 69.60%, with GPT-4 + AC-Reason achieving 75.04%. On AC-Bench, GPT-4 + AC-Reason again achieves the highest accuracy of 71.82%. AC-Bench further enables fine-grained analysis of reasoning faithfulness, revealing that only Qwen-2.5-72B-Instruct, Claude-3.5-Sonnet, and GPT-4o exhibit faithful reasoning, whereas GPT-4 tends to exploit shortcuts. Finally, our ablation study proves that integrating AC theory into LLMs is highly effective, with the proposed algorithm contributing the most significant performance gains.

Problem

Research questions and friction points this paper is trying to address.

Lack of formal AC theory grounding in LLM-based causality methods

Need for interpretable actual causality reasoning framework

Absence of comprehensive benchmark for evaluating AC reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-formal reasoning framework for causality

Theory-guided algorithm with explanations

New benchmark AC-Bench for evaluation

🔎 Similar Papers

Causal Inference with Large Language Model: A Survey