🤖 AI Summary
Initial errors in LLM reasoning chains tend to propagate, undermining conclusion reliability; existing detection methods are limited because they neglect the error propagation mechanism. To address this, we propose ARES, a framework that incrementally evaluates each reasoning step based on verified premises and introduces an inductive probabilistic assessment mechanism—providing statistically grounded, fine-grained correctness certification per step, thereby overcoming the fragility of binary correctness labels. ARES models autoregressive reasoning entailment stability and performs stepwise independent evaluation, effectively isolating error propagation paths. On four standard benchmarks, ARES achieves a Macro-F1 score of 72.1% (+8.2 points over prior work); for detecting propagated errors in long reasoning chains, it attains 90.3% F1 (+27.6 points), significantly outperforming state-of-the-art approaches.
📝 Abstract
In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).