M-MAD: Multidimensional Multi-Agent Debate Framework for Fine-grained Machine Translation Evaluation

📅 2024-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM-as-a-judge approaches exhibit insufficient accuracy and reliability in machine translation quality evaluation. To address this, we propose a multidimensional multi-agent debate framework: first, fine-grained dimensions—including semantics, grammar, and terminology—are decoupled based on the MQM standard; second, specialized LLM agents engage in structured, cross-dimensional collaborative debate following a predefined protocol; finally, a hierarchical weighted aggregation mechanism produces an interpretable and robust holistic score. This work introduces the novel “multidimensional decoupling + debate-based coordination + hierarchical aggregation” paradigm, overcoming inherent limitations of single-model judgment. Experiments demonstrate that our framework consistently outperforms existing LLM-based evaluators across multiple benchmarks, achieving performance on par with state-of-the-art reference-based metrics (e.g., COMET), while maintaining strong efficacy even on lightweight models such as GPT-4o mini.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) have given rise to the LLM-as-a-judge paradigm, showcasing their potential to deliver human-like judgments. However, in the field of machine translation (MT) evaluation, current LLM-as-a-judge methods fall short of learned automatic metrics. In this paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our findings demonstrate that M-MAD achieves significant advancements by (1) decoupling heuristic MQM criteria into distinct evaluation dimensions for fine-grained assessments; (2) employing multi-agent debates to harness the collaborative reasoning capabilities of LLMs; (3) synthesizing dimension-specific results into a final evaluation judgment to ensure robust and reliable outcomes. Comprehensive experiments show that M-MAD not only outperforms all existing LLM-as-a-judge methods but also competes with state-of-the-art reference-based automatic metrics, even when powered by a suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight the superiority of our framework design, offering a fresh perspective for LLM-as-a-judge paradigm. Our code and data are publicly available at https://github.com/SU-JIAYUAN/M-MAD.
Problem

Research questions and friction points this paper is trying to address.

Machine Translation
Quality Assessment
Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dimensional Evaluation
Multi-Agent Debate
Machine Translation Quality Assessment
🔎 Similar Papers
No similar papers found.
Z
Zhaopeng Feng
Zhejiang University
Jiayuan Su
Jiayuan Su
Zhejiang University
LLMPost-TrainingReasoning
J
Jiamei Zheng
Zhejiang University
J
Jiahan Ren
Zhejiang University
Y
Yan Zhang
National University of Singapore
J
Jian Wu
Zhejiang University
H
Hongwei Wang
Zhejiang University
Zuozhu Liu
Zuozhu Liu
Assistant Professor, Zhejiang University/University of Illinois Urbana-Champaign
deep learningvision-language modelsmedical AI