Probabilistic Soundness Guarantees in LLM Reasoning Chains

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Initial errors in LLM reasoning chains tend to propagate, undermining conclusion reliability; existing detection methods are limited because they neglect the error propagation mechanism. To address this, we propose ARES, a framework that incrementally evaluates each reasoning step based on verified premises and introduces an inductive probabilistic assessment mechanism—providing statistically grounded, fine-grained correctness certification per step, thereby overcoming the fragility of binary correctness labels. ARES models autoregressive reasoning entailment stability and performs stepwise independent evaluation, effectively isolating error propagation paths. On four standard benchmarks, ARES achieves a Macro-F1 score of 72.1% (+8.2 points over prior work); for detecting propagated errors in long reasoning chains, it attains 90.3% F1 (+27.6 points), significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).

Problem

Research questions and friction points this paper is trying to address.

Detecting propagated errors in LLM reasoning chains

Improving reliability of final conclusions in LLM reasoning

Providing probabilistic soundness guarantees for reasoning steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive probabilistic framework for error detection

Inductive scoring with certified soundness guarantees

State-of-the-art performance on multiple benchmarks

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting