From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

๐Ÿ“… 2026-01-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge that large language models struggle to perform reliable mathematical reasoning in real-world scenarios, particularly when modeling problems from descriptive contexts. The authors propose the first systematic evaluation framework, introducing the ContextMATH benchmark, which reformulates abstract math problems into two task types: Situation Grounding (SG) and Complexity Scaling (CS), based on AIME and MATH-500 datasets. Through comprehensive evaluation of 61 models, error attribution analysis, and fine-tuning experiments, they identify problem modeling as the key bottleneckโ€”its difficulty scales with model size. Training solely on modeling skills proves ineffective; full-context fine-tuning is necessary. Results show both open- and closed-source models suffer average performance drops of 13โ€“34 points on SG/CS tasks, with fine-tuning offering only partial recovery, underscoring contextualized mathematical reasoning as a significant unresolved challenge.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fine-tuning with scenario data improves performance, whereas formulation-only training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.
Problem

Research questions and friction points this paper is trying to address.

contextual mathematical reasoning
large language models
problem formulation
mathematical problem solving
real-world applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

contextual mathematical reasoning
problem formulation
ContextMATH benchmark
scenario grounding
complexity scaling
๐Ÿ”Ž Similar Papers
No similar papers found.