Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the limitations of evaluating large language models (LLMs) in mathematical reasoning solely through answer accuracy, which fails to capture strategic diversity and flexibility. The authors propose the first strategy-level evaluation framework, leveraging the Art of Problem Solving (AoPS) taxonomy of 217 strategy families. Applying this framework to 80 AMC/AIME problems, they combine dual-AI automatic annotation, human arbitration, and multi-round experiments to systematically quantify the effectiveness, correctness, and coverage of model-generated strategies relative to human reference solutions. Despite achieving 95%–100% accuracy under single-solution prompting, four state-of-the-art models collectively cover only 71% of human strategies in multi-strategy settings. Notably, the models also produce 50 novel and valid strategies, revealing a significant disconnect between solution accuracy and strategic coverage.

📝 Abstract

Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models collectively produce 50 benchmark-novel valid strategies, indicating both incomplete coverage of human strategies and some capacity for alternative reasoning. A repeated-run robustness check on 20 problems shows diminishing gains in discovered strategies, with the strongest model recovering only 39 of 55 AoPS-reference strategies (71%) after three runs. These findings position strategy diversity as a complementary dimension for evaluating mathematical reasoning beyond answer correctness.

Problem

Research questions and friction points this paper is trying to address.

strategy diversity

mathematical reasoning

large language models

evaluation framework

problem-solving strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

strategy diversity

mathematical reasoning

large language models