🤖 AI Summary
Existing mathematical reasoning evaluation relies excessively on answer accuracy, failing to distinguish genuine logical reasoning from superficial pattern matching. Method: We propose a four-dimensional diagnostic framework—assessing forward/backward consistency, transitive coverage, counterfactual sensitivity, and perturbation robustness—to systematically disentangle deductive reasoning from behavioral mimicry for the first time. Contribution/Results: Empirical analysis of Qwen3-0.6B on MenatQA reveals a critical disconnect: while answer accuracy exceeds 70%, backward consistency is only 15% and transitive coverage merely 32.2%, exposing severe reasoning fragility in small-scale models. Our framework shifts evaluation focus from outcome correctness to verifiable reasoning process properties, demonstrating strong cross-model generalizability. It establishes a novel, process-aware paradigm for trustworthy mathematical reasoning assessment.
📝 Abstract
Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.