XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-lingual open-ended generation—where input and output languages differ—has long suffered from a lack of systematic evaluation benchmarks and high-quality training data. To address this, we propose: (1) XL-AlpacaEval, the first authoritative benchmark specifically designed for cross-lingual open generation; (2) XL-Instruct, a synthetic data generation method leveraging large-model iterative backtranslation coupled with multi-dimensional quality filtering to produce high-fidelity, linguistically aligned instruction data across languages; and (3) effective fine-tuning using only 8K synthetic instructions, which boosts model win rates against GPT-4o-Mini on XL-AlpacaEval from 7.4% to 21.5%, with significant improvements across multiple fine-grained metrics. The resulting models also demonstrate strong zero-shot cross-lingual and cross-task generalization capabilities.

Technology Category

Application Category

📝 Abstract
Cross-lingual open-ended generation -- i.e. generating responses in a desired language different from that of the user's query -- is an important yet understudied problem. We introduce XL-AlpacaEval, a new benchmark for evaluating cross-lingual generation capabilities in Large Language Models (LLMs), and propose XL-Instruct, a high-quality synthetic data generation method. Fine-tuning with just 8K XL-Instruct-generated instructions significantly improves model performance, increasing the win rate against GPT-4o-Mini from 7.4% to 21.5%, and improving on several fine-grained quality metrics. Additionally, models fine-tuned on XL-Instruct exhibit strong zero-shot transfer to both English-only and multilingual generation tasks. Given its consistent gains across the board, we strongly recommend incorporating XL-Instruct in the post-training pipeline of future multilingual LLMs. To facilitate further research, we will publicly and freely release the XL-Instruct and XL-AlpacaEval datasets, which constitute two of the few cross-lingual resources currently available in the literature.
Problem

Research questions and friction points this paper is trying to address.

Addressing cross-lingual open-ended generation challenges in LLMs
Improving model performance via high-quality synthetic data generation
Enhancing zero-shot transfer for multilingual generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic cross-lingual instruction data
Improves model performance via fine-tuning
Enables zero-shot transfer for multilingual tasks
🔎 Similar Papers
No similar papers found.