You Are What You Train: Effects of Data Composition on Training Context-aware Machine Translation Models

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study investigates how the sparsity of context-rich samples in training data impedes the performance of context-aware machine translation (CMT) models. Using controllably constructed datasets with varying proportions of context-dependent instances, we identify context sample sparsity as a critical bottleneck limiting effective contextual utilization; moreover, performance gains do not generalize across distinct contextual phenomena, and intra-family cross-lingual transfer remains limited. To address this, we propose two novel training strategies: context-aware data reweighting and phased context reinforcement training. Evaluated on the ctxPro benchmark, our methods yield up to 6-percentage-point and 8-percentage-point accuracy improvements in monolingual and multilingual settings, respectively. These results demonstrate substantially enhanced discourse-level contextual understanding and integration. Our approach establishes a reproducible, generalizable paradigm for CMT data curation and model training, advancing both empirical methodology and practical deployment of context-aware MT systems.

Technology Category

Application Category

📝 Abstract

Achieving human-level translations requires leveraging context to ensure coherence and handle complex phenomena like pronoun disambiguation. Sparsity of contextually rich examples in the standard training data has been hypothesized as the reason for the difficulty of context utilization. In this work, we systematically validate this claim in both single- and multilingual settings by constructing training datasets with a controlled proportions of contextually relevant examples. We demonstrate a strong association between training data sparsity and model performance confirming sparsity as a key bottleneck. Importantly, we reveal that improvements in one contextual phenomenon do no generalize to others. While we observe some cross-lingual transfer, it is not significantly higher between languages within the same sub-family. Finally, we propose and empirically evaluate two training strategies designed to leverage the available data. These strategies improve context utilization, resulting in accuracy gains of up to 6 and 8 percentage points on the ctxPro evaluation in single- and multilingual settings respectively.

Problem

Research questions and friction points this paper is trying to address.

Investigating data sparsity impact on context-aware machine translation

Examining generalization limitations across contextual phenomena

Proposing training strategies to improve context utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled dataset construction with context proportions

Training strategies improving context utilization

Empirical evaluation showing accuracy gains

🔎 Similar Papers

No similar papers found.