Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Direct Preference Optimization (DPO) in machine translation faces two key challenges: (1) reward signals from quality evaluation models are often narrow, failing to capture critical errors such as translation hallucinations; and (2) reliance on single win-loss pairs underutilizes rich preference signals embedded across multiple candidate translations. To address these, we propose M²PO—a Multi-pair, Multi-perspective Preference Optimization framework. M²PO introduces a robust, multi-faceted reward function integrating factuality penalties, dynamic quality scoring, external human-in-the-loop evaluation, and model self-assessment. It further employs a systematic multi-pair construction strategy to extract fine-grained preference relations among candidates. Evaluated on the WMT21–22 benchmarks, M²PO significantly outperforms existing DPO methods, achieving translation fidelity and overall quality competitive with leading proprietary large language models.

Technology Category

Application Category

📝 Abstract
Direct Preference Optimization (DPO) is a powerful paradigm for aligning Large Language Models (LLMs) to human preferences in Machine Translation (MT), but current methods are hindered by two fundamental challenges: (1) flawed reward signals from Quality Estimation (QE) models that overlook critical errors like translation hallucination, and (2) inefficient data utilization that discards valuable learning signals by selecting only a single win-loss pair. To address these limitations, we introduce M^2PO: Multi-Pair, Multi-Perspective Preference Optimization. Our framework integrates a multi-perspective reward engine that creates a more robust signal by combining two key viewpoints: a new hallucination penalty for factuality, and an innovative dynamic quality score that adaptively fuses external evaluations with the model's own evolving judgment. This is synergistically paired with a multi-pair construction strategy that systematically creates a comprehensive set of preference pairs from the entire pool of translation candidates. This synergistic approach ensures the model learns from a richer spectrum of quality trade-offs, leading to more robust and faithful translations. On challenging WMT21-22 benchmarks, M^2PO substantially outperforms existing preference optimization methods and demonstrates highly competitive performance against leading proprietary LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses flawed reward signals in translation quality estimation
Improves data utilization by creating multiple preference pairs
Enhances translation faithfulness by penalizing hallucination errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-perspective reward engine combining hallucination penalty
Dynamic quality score fusing external and internal evaluations
Multi-pair construction strategy utilizing all translation candidates
🔎 Similar Papers
No similar papers found.