When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

πŸ“… 2026-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the critical issue of "silent failures" and computational instability in large language models for mathematical reasoning, where high accuracy often masks unreliable inference processes. The authors propose a novel metric for reasoning fidelity, integrating relevance analysis, parameter-scale comparisons (1.5B vs. 7B), and chain-of-thought clustering. Their analysis reveals that only 18.4% of Qwen2.5-Math-7B’s 61% accuracy stems from stable reasoning, while 81.6% relies on inconsistent paths, with 8.8% attributable to silent failures. Despite a 4.7Γ— increase in model size, performance does not improve, and reasoning quality exhibits a weak negative correlation with correctness (r = –0.21). These findings introduce the β€œdepth–accuracy paradox,” challenging prevailing evaluation paradigms in mathematical reasoning benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways. Additionally, 8.8% of all predictions are silent failures -- confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness (r=-0.21, p=0.002), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7x increase) provides zero accuracy benefit on our evaluated subset (6% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with ~20% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.
Problem

Research questions and friction points this paper is trying to address.

mathematical reasoning
computational instability
silent failures
reasoning faithfulness
accuracy paradox
Innovation

Methods, ideas, or system contributions that make the work stand out.

faithfulness metrics
silent failures
reasoning stability
depth-accuracy paradox
computational inconsistency
πŸ”Ž Similar Papers
No similar papers found.