Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation and cross-model comparability in assessing large language models’ (LLMs) mathematical reasoning capabilities. We conduct the first comprehensive, multi-dimensional evaluation of seven prominent LLMs—including DeepSeek-R1—across three canonical mathematical benchmarks: GSM8K, MATH, and AMC. Our methodology integrates accuracy measurement, response latency analysis, and attribution analysis of architectural design and training strategies. Key contributions are: (1) revealing that DeepSeek-R1 achieves state-of-the-art performance on GSM8K and MATH, whereas distilled variants suffer significant degradation; (2) demonstrating that model compression—particularly knowledge distillation—severely impairs mathematical reasoning; and (3) identifying three fundamental bottlenecks: weak symbolic manipulation, unstable chain-of-thought reasoning, and insufficient coverage of mathematical concepts in training data—thereby proposing architecture- and training-oriented improvements specifically for enhancing mathematical reasoning.

Technology Category

Application Category

📝 Abstract

With the rapid evolution of Artificial Intelligence (AI), Large Language Models (LLMs) have reshaped the frontiers of various fields, spanning healthcare, public health, engineering, science, agriculture, education, arts, humanities, and mathematical reasoning. Among these advancements, DeepSeek models have emerged as noteworthy contenders, demonstrating promising capabilities that set them apart from their peers. While previous studies have conducted comparative analyses of LLMs, few have delivered a comprehensive evaluation of mathematical reasoning across a broad spectrum of LLMs. In this work, we aim to bridge this gap by conducting an in-depth comparative study, focusing on the strengths and limitations of DeepSeek models in relation to their leading counterparts. In particular, our study systematically evaluates the mathematical reasoning performance of two DeepSeek models alongside five prominent LLMs across three independent benchmark datasets. The findings reveal several key insights: 1). DeepSeek-R1 consistently achieved the highest accuracy on two of the three datasets, demonstrating strong mathematical reasoning capabilities. 2). The distilled variant of LLMs significantly underperformed compared to its peers, highlighting potential drawbacks in using distillation techniques. 3). In terms of response time, Gemini 2.0 Flash demonstrated the fastest processing speed, outperforming other models in efficiency, which is a crucial factor for real-time applications. Beyond these quantitative assessments, we delve into how architecture, training, and optimization impact LLMs' mathematical reasoning. Moreover, our study goes beyond mere performance comparison by identifying key areas for future advancements in LLM-driven mathematical reasoning. This research enhances our understanding of LLMs' mathematical reasoning and lays the groundwork for future advancements

Problem

Research questions and friction points this paper is trying to address.

Comparative evaluation of mathematical reasoning in DeepSeek and other LLMs.

Impact of architecture, training, and optimization on LLMs' mathematical reasoning.

Identification of key areas for future advancements in LLM-driven mathematical reasoning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative study of DeepSeek models' mathematical reasoning

Evaluation across three benchmark datasets for accuracy

Analysis of architecture, training, and optimization impacts

🔎 Similar Papers

No similar papers found.