When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the critical issue of "silent failures" and computational instability in large language models for mathematical reasoning, where high accuracy often masks unreliable inference processes. The authors propose a novel metric for reasoning fidelity, integrating relevance analysis, parameter-scale comparisons (1.5B vs. 7B), and chain-of-thought clustering. Their analysis reveals that only 18.4% of Qwen2.5-Math-7B’s 61% accuracy stems from stable reasoning, while 81.6% relies on inconsistent paths, with 8.8% attributable to silent failures. Despite a 4.7× increase in model size, performance does not improve, and reasoning quality exhibits a weak negative correlation with correctness (r = –0.21). These findings introduce the “depth–accuracy paradox,” challenging prevailing evaluation paradigms in mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways. Additionally, 8.8% of all predictions are silent failures -- confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness (r=-0.21, p=0.002), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7x increase) provides zero accuracy benefit on our evaluated subset (6% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with ~20% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.

Problem

Research questions and friction points this paper is trying to address.

mathematical reasoning

computational instability

silent failures

reasoning faithfulness

accuracy paradox

Innovation

Methods, ideas, or system contributions that make the work stand out.

faithfulness metrics

silent failures

reasoning stability

depth-accuracy paradox

computational inconsistency

🔎 Similar Papers

No similar papers found.

Authors to Follow