M-MAD: Multidimensional Multi-Agent Debate Framework for Fine-grained Machine Translation Evaluation

📅 2024-12-28

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Current LLM-as-a-judge approaches exhibit insufficient accuracy and reliability in machine translation quality evaluation. To address this, we propose a multidimensional multi-agent debate framework: first, fine-grained dimensions—including semantics, grammar, and terminology—are decoupled based on the MQM standard; second, specialized LLM agents engage in structured, cross-dimensional collaborative debate following a predefined protocol; finally, a hierarchical weighted aggregation mechanism produces an interpretable and robust holistic score. This work introduces the novel “multidimensional decoupling + debate-based coordination + hierarchical aggregation” paradigm, overcoming inherent limitations of single-model judgment. Experiments demonstrate that our framework consistently outperforms existing LLM-based evaluators across multiple benchmarks, achieving performance on par with state-of-the-art reference-based metrics (e.g., COMET), while maintaining strong efficacy even on lightweight models such as GPT-4o mini.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have given rise to the LLM-as-a-judge paradigm, showcasing their potential to deliver human-like judgments. However, in the field of machine translation (MT) evaluation, current LLM-as-a-judge methods fall short of learned automatic metrics. In this paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our findings demonstrate that M-MAD achieves significant advancements by (1) decoupling heuristic MQM criteria into distinct evaluation dimensions for fine-grained assessments; (2) employing multi-agent debates to harness the collaborative reasoning capabilities of LLMs; (3) synthesizing dimension-specific results into a final evaluation judgment to ensure robust and reliable outcomes. Comprehensive experiments show that M-MAD not only outperforms all existing LLM-as-a-judge methods but also competes with state-of-the-art reference-based automatic metrics, even when powered by a suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight the superiority of our framework design, offering a fresh perspective for LLM-as-a-judge paradigm. Our code and data are publicly available at https://github.com/SU-JIAYUAN/M-MAD.

Problem

Research questions and friction points this paper is trying to address.

Machine Translation

Quality Assessment

Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dimensional Evaluation

Multi-Agent Debate

Machine Translation Quality Assessment

🔎 Similar Papers

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

2024-10-07arXiv.orgCitations: 11

Evaluating the Performance of Large Language Models via Debates

2024-06-16arXiv.orgCitations: 2