CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of aligning multilingual large language models with human preferences in the absence of multilingual preference annotations. The authors propose a method that leverages model-generated multilingual responses to construct cross-lingual contrastive preference signals, enabling effective ranking across languages using a reward model trained solely on English data. This approach demonstrates, for the first time, that an English-only reward model can generalize to both high- and low-resource languages, facilitating cross-lingual preference transfer while mitigating catastrophic forgetting commonly observed in supervised fine-tuning. Experimental results show that, on structured tasks, EuroLLM-9B outperforms baselines in six out of seven languages and Aya-3B in four out of four; on open-ended generation tasks, both models significantly surpass the original baseline across all eleven evaluated languages.
📝 Abstract
Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.
Problem

Research questions and friction points this paper is trying to address.

multilingual preference tuning
cross-lingual transfer
preference optimization
language models
self-generations
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual preference tuning
contrastive learning
self-generation
multilingual LLMs
reward modeling
🔎 Similar Papers
No similar papers found.