SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical benchmarks over-rely on final-answer accuracy, failing to distinguish genuine reasoning capability from superficial pattern matching. To address this, we propose the first multidimensional diagnostic evaluation framework for assessing large language models’ mathematical problem-solving abilities, independently quantifying four orthogonal dimensions: comprehension, reasoning, computation, and reflection & correction. Our method introduces a self-generation–self-verification mechanism to construct a high-fidelity, scalable benchmark, coupled with a prompt-engineering–driven dimension-decoupled assessment paradigm and an automated pipeline for question generation and answer validation. Experiments across 21 mainstream LLMs reveal substantial heterogeneity across dimensions and demonstrate that final-answer accuracy is highly misleading. The framework effectively identifies model-specific weaknesses, enabling more comprehensive, interpretable, and fine-grained evaluation of mathematical reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Large Language Models have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine mathematical reasoning or superficial pattern recognition. Common evaluation metrics, such as final answer accuracy, fail to disentangle the underlying competencies involved, offering limited diagnostic value. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework. SMART decomposes mathematical problem solving into four distinct dimensions: understanding, reasoning, arithmetic, and reflection &refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. Crucially, SMART integrates an automated self-generating and self-validating mechanism to produce and verify benchmark data, ensuring both scalability and reliability. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings demonstrate the inadequacy of final answer accuracy as a sole metric and motivate a new holistic metric to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Assessing genuine mathematical reasoning in LLMs beyond pattern recognition
Overcoming limitations of final answer accuracy as a diagnostic metric
Providing interpretable multi-dimensional evaluation of LLM problem-solving skills
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes math problem solving into four dimensions
Uses automated self-generating and self-validating mechanism
Evaluates models with tailored tasks for each dimension
🔎 Similar Papers
No similar papers found.
Y
Yujie Hou
School of Artificial Intelligence, Beijing Normal University, Beijing, China
T
Ting Zhang
School of Artificial Intelligence, Beijing Normal University, Beijing, China
Mei Wang
Mei Wang
Beijing Normal University
face recognitionfairness in AIdomain adaptation
X
Xuetao Ma
School of Artificial Intelligence, Beijing Normal University, Beijing, China
Hu Huang
Hu Huang
university of science and technology of china
Social ComputingStance Detection