SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing mathematical benchmarks over-rely on final-answer accuracy, failing to distinguish genuine reasoning capability from superficial pattern matching. To address this, we propose the first multidimensional diagnostic evaluation framework for assessing large language models’ mathematical problem-solving abilities, independently quantifying four orthogonal dimensions: comprehension, reasoning, computation, and reflection & correction. Our method introduces a self-generation–self-verification mechanism to construct a high-fidelity, scalable benchmark, coupled with a prompt-engineering–driven dimension-decoupled assessment paradigm and an automated pipeline for question generation and answer validation. Experiments across 21 mainstream LLMs reveal substantial heterogeneity across dimensions and demonstrate that final-answer accuracy is highly misleading. The framework effectively identifies model-specific weaknesses, enabling more comprehensive, interpretable, and fine-grained evaluation of mathematical reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine mathematical reasoning or superficial pattern recognition. Common evaluation metrics, such as final answer accuracy, fail to disentangle the underlying competencies involved, offering limited diagnostic value. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework. SMART decomposes mathematical problem solving into four distinct dimensions: understanding, reasoning, arithmetic, and reflection &refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. Crucially, SMART integrates an automated self-generating and self-validating mechanism to produce and verify benchmark data, ensuring both scalability and reliability. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings demonstrate the inadequacy of final answer accuracy as a sole metric and motivate a new holistic metric to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Assessing genuine mathematical reasoning in LLMs beyond pattern recognition

Overcoming limitations of final answer accuracy as a diagnostic metric

Providing interpretable multi-dimensional evaluation of LLM problem-solving skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes math problem solving into four dimensions

Uses automated self-generating and self-validating mechanism

Evaluates models with tailored tasks for each dimension

🔎 Similar Papers

LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages