DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work systematically evaluates the effectiveness of reasoning-oriented large language models (e.g., o3-mini, DeepSeek-R1) in automatic evaluation of machine translation and text summarization. Using the WMT23 and SummEval benchmarks, we compare eight models spanning 8B–70B parameters and three architectural categories—reasoning, distilled, and non-reasoning—via correlation analysis and controlled-variable experiments. Our key findings are: (1) reasoning capability yields task-conditional gains in NLG evaluation—o3-mini consistently improves with stronger reasoning, whereas DeepSeek-R1 underperforms its non-reasoning baseline on most MT tasks; (2) reasoning token usage positively correlates with evaluation quality; (3) 32B distilled models robustly preserve performance, while 8B variants suffer significant degradation. This study is the first to reveal the conditional efficacy and scale dependency of reasoning mechanisms in NLG evaluation.

Technology Category

Application Category

📝 Abstract

Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning LLMs' effectiveness in NLG tasks

Comparing reasoning vs non-reasoning LLMs for MT and TS

Assessing impact of model size on reasoning capability distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning LLMs evaluate MT and summarization tasks

Compare reasoning and non-reasoning LLMs systematically

Distillation maintains performance in medium-sized models

🔎 Similar Papers

No similar papers found.