π€ AI Summary
This work addresses the critical issue of "silent failures" and computational instability in large language models for mathematical reasoning, where high accuracy often masks unreliable inference processes. The authors propose a novel metric for reasoning fidelity, integrating relevance analysis, parameter-scale comparisons (1.5B vs. 7B), and chain-of-thought clustering. Their analysis reveals that only 18.4% of Qwen2.5-Math-7Bβs 61% accuracy stems from stable reasoning, while 81.6% relies on inconsistent paths, with 8.8% attributable to silent failures. Despite a 4.7Γ increase in model size, performance does not improve, and reasoning quality exhibits a weak negative correlation with correctness (r = β0.21). These findings introduce the βdepthβaccuracy paradox,β challenging prevailing evaluation paradigms in mathematical reasoning benchmarks.
π Abstract
Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways. Additionally, 8.8% of all predictions are silent failures -- confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness (r=-0.21, p=0.002), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7x increase) provides zero accuracy benefit on our evaluated subset (6% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with ~20% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.