Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing mathematical reasoning evaluation relies excessively on answer accuracy, failing to distinguish genuine logical reasoning from superficial pattern matching. Method: We propose a four-dimensional diagnostic framework—assessing forward/backward consistency, transitive coverage, counterfactual sensitivity, and perturbation robustness—to systematically disentangle deductive reasoning from behavioral mimicry for the first time. Contribution/Results: Empirical analysis of Qwen3-0.6B on MenatQA reveals a critical disconnect: while answer accuracy exceeds 70%, backward consistency is only 15% and transitive coverage merely 32.2%, exposing severe reasoning fragility in small-scale models. Our framework shifts evaluation focus from outcome correctness to verifiable reasoning process properties, demonstrating strong cross-model generalizability. It establishes a novel, process-aware paradigm for trustworthy mathematical reasoning assessment.

Technology Category

Application Category

📝 Abstract

Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Distinguish genuine mathematical reasoning from superficial pattern matching in language models.

Reveal disconnect between surface accuracy and underlying reasoning fidelity in models.

Develop model-agnostic diagnostic framework to assess verifiable mathematical reasoning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagnostic framework distinguishes genuine reasoning from pattern matching.

Four axes: consistency, transitivity, sensitivity, robustness evaluate reasoning.

Model-agnostic protocols assess reasoning fidelity beyond accuracy metrics.

🔎 Similar Papers

The Buffer Mechanism for Multi-Step Information Reasoning in Language Models