🤖 AI Summary
Synthetic data in Machine Translation Quality Estimation (QE) suffers from distributional shift—mismatch between pseudo-translations and authentic translations, and misalignment between pseudo-labels and human preferences. Method: We propose ADSQE, a novel framework that integrates constrained beam search, multi-model collaborative generation, reference-guided word-level annotation, and error-propagation-based phrase-level label inference. Crucially, it prohibits translation models from self-evaluating their outputs to avoid circular bias. Contribution/Results: ADSQE is the first to leverage reference translations to guide both synthetic data generation and fine-grained annotation, and introduces a shortest-error-phrase identification mechanism aligned with human annotator behavior. Experiments demonstrate that ADSQE consistently outperforms state-of-the-art methods—including COMET—on both supervised and unsupervised QE benchmarks. Moreover, it significantly enhances the efficacy of synthetic data for reward model training.
📝 Abstract
Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task. Due to the data scarcity, synthetic data generation has emerged as a promising solution. However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences. To tackle this issue, we introduce ADSQE, a novel framework for alleviating distribution shift in synthetic QE data. To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models. ADSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes, enhancing the quality of word-level labels. ADSE further identifies the shortest phrase covering consecutive error tokens, mimicking human annotation behavior, to assign the final phrase-level labels. Specially, we underscore that the translation model can not annotate translations of itself accurately. Extensive experiments demonstrate that ADSQE outperforms SOTA baselines like COMET in both supervised and unsupervised settings. Further analysis offers insights into synthetic data generation that could benefit reward models for other tasks.