🤖 AI Summary
Existing research indicates that preference optimization in machine translation is prone to catastrophic forgetting of easy samples due to suboptimal training data ordering, which degrades overall performance. To address this issue, this work proposes CLewR, a curriculum learning strategy with a restart mechanism that, for the first time, systematically incorporates an easy-to-hard, multi-cycle data scheduling approach into preference optimization. This method effectively mitigates catastrophic forgetting and seamlessly integrates with mainstream preference optimization algorithms such as DPO and IPO. Consistent performance improvements are demonstrated across multiple large language models, including Gemma2, Qwen2.5, and Llama3.1. The implementation code has been made publicly available.
📝 Abstract
Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra-dragomir/CLewR.