🤖 AI Summary
Cross-lingual open-ended generation—where input and output languages differ—has long suffered from a lack of systematic evaluation benchmarks and high-quality training data. To address this, we propose: (1) XL-AlpacaEval, the first authoritative benchmark specifically designed for cross-lingual open generation; (2) XL-Instruct, a synthetic data generation method leveraging large-model iterative backtranslation coupled with multi-dimensional quality filtering to produce high-fidelity, linguistically aligned instruction data across languages; and (3) effective fine-tuning using only 8K synthetic instructions, which boosts model win rates against GPT-4o-Mini on XL-AlpacaEval from 7.4% to 21.5%, with significant improvements across multiple fine-grained metrics. The resulting models also demonstrate strong zero-shot cross-lingual and cross-task generalization capabilities.
📝 Abstract
Cross-lingual open-ended generation -- i.e. generating responses in a desired language different from that of the user's query -- is an important yet understudied problem. We introduce XL-AlpacaEval, a new benchmark for evaluating cross-lingual generation capabilities in Large Language Models (LLMs), and propose XL-Instruct, a high-quality synthetic data generation method. Fine-tuning with just 8K XL-Instruct-generated instructions significantly improves model performance, increasing the win rate against GPT-4o-Mini from 7.4% to 21.5%, and improving on several fine-grained quality metrics. Additionally, models fine-tuned on XL-Instruct exhibit strong zero-shot transfer to both English-only and multilingual generation tasks. Given its consistent gains across the board, we strongly recommend incorporating XL-Instruct in the post-training pipeline of future multilingual LLMs. To facilitate further research, we will publicly and freely release the XL-Instruct and XL-AlpacaEval datasets, which constitute two of the few cross-lingual resources currently available in the literature.