Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation and cross-model comparability in assessing large language models’ (LLMs) mathematical reasoning capabilities. We conduct the first comprehensive, multi-dimensional evaluation of seven prominent LLMs—including DeepSeek-R1—across three canonical mathematical benchmarks: GSM8K, MATH, and AMC. Our methodology integrates accuracy measurement, response latency analysis, and attribution analysis of architectural design and training strategies. Key contributions are: (1) revealing that DeepSeek-R1 achieves state-of-the-art performance on GSM8K and MATH, whereas distilled variants suffer significant degradation; (2) demonstrating that model compression—particularly knowledge distillation—severely impairs mathematical reasoning; and (3) identifying three fundamental bottlenecks: weak symbolic manipulation, unstable chain-of-thought reasoning, and insufficient coverage of mathematical concepts in training data—thereby proposing architecture- and training-oriented improvements specifically for enhancing mathematical reasoning.

Technology Category

Application Category

📝 Abstract
With the rapid evolution of Artificial Intelligence (AI), Large Language Models (LLMs) have reshaped the frontiers of various fields, spanning healthcare, public health, engineering, science, agriculture, education, arts, humanities, and mathematical reasoning. Among these advancements, DeepSeek models have emerged as noteworthy contenders, demonstrating promising capabilities that set them apart from their peers. While previous studies have conducted comparative analyses of LLMs, few have delivered a comprehensive evaluation of mathematical reasoning across a broad spectrum of LLMs. In this work, we aim to bridge this gap by conducting an in-depth comparative study, focusing on the strengths and limitations of DeepSeek models in relation to their leading counterparts. In particular, our study systematically evaluates the mathematical reasoning performance of two DeepSeek models alongside five prominent LLMs across three independent benchmark datasets. The findings reveal several key insights: 1). DeepSeek-R1 consistently achieved the highest accuracy on two of the three datasets, demonstrating strong mathematical reasoning capabilities. 2). The distilled variant of LLMs significantly underperformed compared to its peers, highlighting potential drawbacks in using distillation techniques. 3). In terms of response time, Gemini 2.0 Flash demonstrated the fastest processing speed, outperforming other models in efficiency, which is a crucial factor for real-time applications. Beyond these quantitative assessments, we delve into how architecture, training, and optimization impact LLMs' mathematical reasoning. Moreover, our study goes beyond mere performance comparison by identifying key areas for future advancements in LLM-driven mathematical reasoning. This research enhances our understanding of LLMs' mathematical reasoning and lays the groundwork for future advancements
Problem

Research questions and friction points this paper is trying to address.

Comparative evaluation of mathematical reasoning in DeepSeek and other LLMs.
Impact of architecture, training, and optimization on LLMs' mathematical reasoning.
Identification of key areas for future advancements in LLM-driven mathematical reasoning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative study of DeepSeek models' mathematical reasoning
Evaluation across three benchmark datasets for accuracy
Analysis of architecture, training, and optimization impacts
🔎 Similar Papers
No similar papers found.
Afrar Jahin
Afrar Jahin
Graduate Research Assistant, Augusta University
Large Language ModelMachine LearningDeep Learning
Arif Hassan Zidan
Arif Hassan Zidan
Graduate Research Assistant
Brain ImagingComputational NeuroscienceArtificial IntelligenceMachine Learning
Y
Yu Bao
Department of Graduate Psychology, James Madison University, Harrisonburg, VA, USA
S
Shizhe Liang
Institute of Plant Breeding, Genetics & Genomics, University of Georgia, Athens, GA, USA
Tianming Liu
Tianming Liu
Distinguished Research Professor of Computer Science, University of Georgia
BrainBrain-Inspired AILLMArtificial General IntelligenceQuantum AI
W
Wei Zhang
School of Computer and Cyber Sciences, Augusta University, Augusta, GA, USA