🤖 AI Summary
Existing large language model (LLM) approaches to actual causality (AC) reasoning lack grounding in formal AC theory, resulting in poor interpretability and low faithfulness. Method: We propose AC-Reason—a semi-formal framework that deeply integrates formal AC theory into LLM reasoning—by identifying causal events, quantifying sufficiency, necessity, and normality, and designing axiom-driven query solving and controllable variable manipulation mechanisms without requiring explicit causal graphs. We further construct AC-Bench, a benchmark of ~1K fine-grained, expert-annotated instances for faithful, fine-grained evaluation. Results: GPT-4 augmented with AC-Reason achieves 75.04% accuracy on BBH-CJ (surpassing human average of 69.60%) and 71.82% on AC-Bench. Ablation studies confirm that formal theory integration contributes most to performance gains. Moreover, we首次 uncover shortcut reliance in GPT-4, while newer models (e.g., GPT-4o) exhibit significantly higher reasoning faithfulness.
📝 Abstract
Actual causality (AC), a fundamental aspect of causal reasoning (CR), is responsible for attribution and responsibility assignment in real-world scenarios. However, existing LLM-based methods lack grounding in formal AC theory, resulting in limited interpretability. Therefore, we propose AC-Reason, a semi-formal reasoning framework that identifies causally relevant events within an AC scenario, infers the values of their formal causal factors (e.g., sufficiency, necessity, and normality), and answers AC queries via a theory-guided algorithm with explanations. While AC-Reason does not explicitly construct a causal graph, it operates over variables in the underlying causal structure to support principled reasoning. To enable comprehensive evaluation, we introduce AC-Bench, a new benchmark built upon and substantially extending Big-Bench Hard Causal Judgment (BBH-CJ). AC-Bench comprises ~1K carefully annotated samples, each with detailed reasoning steps and focuses solely on actual causation. The case study shows that synthesized samples in AC-Bench present greater challenges for LLMs. Extensive experiments on BBH-CJ and AC-Bench show that AC-Reason consistently improves LLM performance over baselines. On BBH-CJ, all tested LLMs surpass the average human rater accuracy of 69.60%, with GPT-4 + AC-Reason achieving 75.04%. On AC-Bench, GPT-4 + AC-Reason again achieves the highest accuracy of 71.82%. AC-Bench further enables fine-grained analysis of reasoning faithfulness, revealing that only Qwen-2.5-72B-Instruct, Claude-3.5-Sonnet, and GPT-4o exhibit faithful reasoning, whereas GPT-4 tends to exploit shortcuts. Finally, our ablation study proves that integrating AC theory into LLMs is highly effective, with the proposed algorithm contributing the most significant performance gains.