🤖 AI Summary
Direct Preference Optimization (DPO) in machine translation faces two key challenges: (1) reward signals from quality evaluation models are often narrow, failing to capture critical errors such as translation hallucinations; and (2) reliance on single win-loss pairs underutilizes rich preference signals embedded across multiple candidate translations. To address these, we propose M²PO—a Multi-pair, Multi-perspective Preference Optimization framework. M²PO introduces a robust, multi-faceted reward function integrating factuality penalties, dynamic quality scoring, external human-in-the-loop evaluation, and model self-assessment. It further employs a systematic multi-pair construction strategy to extract fine-grained preference relations among candidates. Evaluated on the WMT21–22 benchmarks, M²PO significantly outperforms existing DPO methods, achieving translation fidelity and overall quality competitive with leading proprietary large language models.
📝 Abstract
Direct Preference Optimization (DPO) is a powerful paradigm for aligning Large Language Models (LLMs) to human preferences in Machine Translation (MT), but current methods are hindered by two fundamental challenges: (1) flawed reward signals from Quality Estimation (QE) models that overlook critical errors like translation hallucination, and (2) inefficient data utilization that discards valuable learning signals by selecting only a single win-loss pair. To address these limitations, we introduce M^2PO: Multi-Pair, Multi-Perspective Preference Optimization. Our framework integrates a multi-perspective reward engine that creates a more robust signal by combining two key viewpoints: a new hallucination penalty for factuality, and an innovative dynamic quality score that adaptively fuses external evaluations with the model's own evolving judgment. This is synergistically paired with a multi-pair construction strategy that systematically creates a comprehensive set of preference pairs from the entire pool of translation candidates. This synergistic approach ensures the model learns from a richer spectrum of quality trade-offs, leading to more robust and faithful translations. On challenging WMT21-22 benchmarks, M^2PO substantially outperforms existing preference optimization methods and demonstrates highly competitive performance against leading proprietary LLMs.