🤖 AI Summary
This work addresses the "alignment tax" commonly observed when scaling large language models to low-resource languages—where performance gains in the target language often come at the cost of catastrophic forgetting of general capabilities. To mitigate this, the authors propose a reinforcement learning framework grounded in semantic rewards, introducing a novel semantic-space alignment paradigm that replaces surface-level imitation in conventional supervised fine-tuning with embedding-layer semantic alignment. The approach leverages Group Relative Policy Optimization (GRPO), eschewing reliance on token-wise likelihood maximization. Evaluated on Tibetan–Chinese translation and Tibetan headline generation tasks, the method substantially alleviates alignment tax, yielding generations with higher semantic fidelity and stronger human preference while preserving robust general capabilities and demonstrating exceptional few-shot transfer performance.
📝 Abstract
Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.