🤖 AI Summary
Existing mathematical benchmarks over-rely on final-answer accuracy, failing to distinguish genuine reasoning capability from superficial pattern matching. To address this, we propose the first multidimensional diagnostic evaluation framework for assessing large language models’ mathematical problem-solving abilities, independently quantifying four orthogonal dimensions: comprehension, reasoning, computation, and reflection & correction. Our method introduces a self-generation–self-verification mechanism to construct a high-fidelity, scalable benchmark, coupled with a prompt-engineering–driven dimension-decoupled assessment paradigm and an automated pipeline for question generation and answer validation. Experiments across 21 mainstream LLMs reveal substantial heterogeneity across dimensions and demonstrate that final-answer accuracy is highly misleading. The framework effectively identifies model-specific weaknesses, enabling more comprehensive, interpretable, and fine-grained evaluation of mathematical reasoning capabilities.
📝 Abstract
Large Language Models have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine mathematical reasoning or superficial pattern recognition. Common evaluation metrics, such as final answer accuracy, fail to disentangle the underlying competencies involved, offering limited diagnostic value. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework. SMART decomposes mathematical problem solving into four distinct dimensions: understanding, reasoning, arithmetic, and reflection &refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. Crucially, SMART integrates an automated self-generating and self-validating mechanism to produce and verify benchmark data, ensuring both scalability and reliability. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings demonstrate the inadequacy of final answer accuracy as a sole metric and motivate a new holistic metric to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.