🤖 AI Summary
This study investigates how the sparsity of context-rich samples in training data impedes the performance of context-aware machine translation (CMT) models. Using controllably constructed datasets with varying proportions of context-dependent instances, we identify context sample sparsity as a critical bottleneck limiting effective contextual utilization; moreover, performance gains do not generalize across distinct contextual phenomena, and intra-family cross-lingual transfer remains limited. To address this, we propose two novel training strategies: context-aware data reweighting and phased context reinforcement training. Evaluated on the ctxPro benchmark, our methods yield up to 6-percentage-point and 8-percentage-point accuracy improvements in monolingual and multilingual settings, respectively. These results demonstrate substantially enhanced discourse-level contextual understanding and integration. Our approach establishes a reproducible, generalizable paradigm for CMT data curation and model training, advancing both empirical methodology and practical deployment of context-aware MT systems.
📝 Abstract
Achieving human-level translations requires leveraging context to ensure coherence and handle complex phenomena like pronoun disambiguation. Sparsity of contextually rich examples in the standard training data has been hypothesized as the reason for the difficulty of context utilization. In this work, we systematically validate this claim in both single- and multilingual settings by constructing training datasets with a controlled proportions of contextually relevant examples. We demonstrate a strong association between training data sparsity and model performance confirming sparsity as a key bottleneck. Importantly, we reveal that improvements in one contextual phenomenon do no generalize to others. While we observe some cross-lingual transfer, it is not significantly higher between languages within the same sub-family. Finally, we propose and empirically evaluate two training strategies designed to leverage the available data. These strategies improve context utilization, resulting in accuracy gains of up to 6 and 8 percentage points on the ctxPro evaluation in single- and multilingual settings respectively.