Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses two key limitations in current LLM mathematical reasoning evaluation: narrow numerical ranges and coarse-grained error attribution. To tackle the first, we introduce GSM-Ranges—a systematically perturbed extension of GSM8K—where numeric values span multiple orders of magnitude, enabling robustness assessment across diverse scales. To address the second, we propose a fine-grained scoring framework that disentangles logical errors (e.g., broken reasoning chains) from non-logical errors (e.g., arithmetic miscalculations), enabling structured, step-level error attribution. Our approach establishes the first scalable evaluation paradigm for numerical range diversity. Empirical analysis reveals that logical error rates increase by up to 14 percentage points with rising numerical complexity, and that models exhibit significantly weaker reasoning performance on embedded word problems compared to pure arithmetic tasks. This work provides both a new benchmark and a diagnostic tool for rigorous, interpretable assessment of mathematical reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

Problem

Research questions and friction points this paper is trying to address.

Assess LLM robustness across numerical scales

Distinguish logical from non-logical errors

Evaluate mathematical reasoning in word problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended numerical range evaluation

Logical error grading methodology

Embedded computation performance analysis

🔎 Similar Papers

Large Language Models Are Struggle to Cope with Unreasonability in Math Problems