How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the capability of large language models (LLMs) to solve end-to-end complex mathematical modeling tasks and quantifies their performance gap relative to human experts. To this end, we introduce the first problem-oriented, stage-wise evaluation framework that integrates phased automated scoring, double-blind expert review, error attribution, and cross-stage consistency validation, ensuring high alignment between assessment criteria and expert judgment. Our findings reveal that while current LLMs perform adequately in the problem understanding phase, they exhibit significant and persistent deficiencies in execution-intensive stages—particularly model formulation, code implementation, and result interpretation. Notably, this bottleneck persists despite increases in model scale, highlighting a critical capability gap between comprehension and execution in complex reasoning tasks.
📝 Abstract
Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.
Problem

Research questions and friction points this paper is trying to address.

large language models
mathematical modeling
end-to-end problem solving
evaluation framework
human expert comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

stage-wise evaluation
mathematical modeling
LLM limitations
comprehension-execution gap
expert-aligned assessment
🔎 Similar Papers
No similar papers found.
Yuhang Liu
Yuhang Liu
The University of Adelaide
Representation LearningLLMsLatent Variable ModelsResponsible AI
H
Heyan Huang
Beijing Institute of Technology, Beijing, China; Southeast Academy of Information Technology, Putian, China
Yizhe Yang
Yizhe Yang
Beijing Institute of Technology
NLPDialogue
H
Hongyan Zhao
Beijing Institute of Technology, Beijing, China
Z
Zhizhuo Zeng
Beijing Institute of Technology, Beijing, China; Southeast Academy of Information Technology, Putian, China
Yang Gao
Yang Gao
Beijing Institute of Technology
Large Language ModelSummarizationIntelligent Applications