🤖 AI Summary
Large language models (LLMs) exhibit systematic reasoning flaws in mathematical problem-solving—including skipped inference steps, cyclic redundancy, and premise misuse—yet conventional evaluation relies solely on final-answer accuracy, failing to assess reasoning process quality.
Method: We propose MAPLE, the first multidimensional, interpretable scoring framework that jointly quantifies error rate, logical validity, and redundancy in reasoning chains. MAPLE leverages fine-grained step-level annotation and formal verification to enable rigorous, process-aware evaluation.
Contribution/Results: Experiments demonstrate that MAPLE effectively uncovers latent reasoning biases across state-of-the-art LLMs. Its scores correlate strongly with human expert judgments and significantly outperform traditional accuracy-based metrics in diagnostic fidelity. By grounding evaluation in verifiable logical structure rather than outcome alone, MAPLE establishes a new paradigm for fine-grained, process-oriented assessment of mathematical reasoning capability.
📝 Abstract
Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.