A method for improving multilingual quality and diversity of instruction fine-tuning datasets

πŸ“… 2025-09-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multilingual instruction fine-tuning (IFT) is hindered by the scarcity of high-quality, semantically diverse training data; existing approaches rely on English-centric heuristics with poor cross-lingual generalizability. To address this, we propose M-DaQβ€”the first language-agnostic, general-purpose data selection framework for multilingual IFT. M-DaQ systematically challenges and surpasses the surface alignment hypothesis (SAH) by jointly modeling multilingual embedding spaces, quantifying semantic diversity, and performing quality-aware clustering to ensure cross-lingually consistent data filtering. Experiments across 18 languages demonstrate that models trained on M-DaQ-curated datasets achieve an average win rate exceeding 60% in pairwise comparisons. Human evaluation further confirms significant improvements in cultural appropriateness and semantic richness of model responses. This work establishes a scalable, language-independent paradigm for constructing high-fidelity instruction data in multilingual settings.

Technology Category

Application Category

πŸ“ Abstract
Multilingual Instruction Fine-Tuning (IFT) is essential for enabling large language models (LLMs) to generalize effectively across diverse linguistic and cultural contexts. However, the scarcity of high-quality multilingual training data and corresponding building method remains a critical bottleneck. While data selection has shown promise in English settings, existing methods often fail to generalize across languages due to reliance on simplistic heuristics or language-specific assumptions. In this work, we introduce Multilingual Data Quality and Diversity (M-DaQ), a novel method for improving LLMs multilinguality, by selecting high-quality and semantically diverse multilingual IFT samples. We further conduct the first systematic investigation of the Superficial Alignment Hypothesis (SAH) in multilingual setting. Empirical results across 18 languages demonstrate that models fine-tuned with M-DaQ method achieve significant performance gains over vanilla baselines over 60% win rate. Human evaluations further validate these gains, highlighting the increment of cultural points in the response. We release the M-DaQ code to support future research.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of high-quality multilingual instruction fine-tuning data
Improves multilingual quality and diversity in LLM training datasets
Generalizes data selection methods across diverse linguistic contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

M-DaQ method for multilingual data selection
Selects high-quality diverse instruction samples
Systematically investigates Superficial Alignment Hypothesis
πŸ”Ž Similar Papers
No similar papers found.
C
Chunguang Zhao
Huawei, Beijing, China
Y
Yilun Liu
Huawei, Beijing, China
P
Pufan Zeng
Huawei, Beijing, China; University of Science and Technology of China, Hefei, China
Yuanchang Luo
Yuanchang Luo
2012@Huawei
Shimin Tao
Shimin Tao
2012 Lab, Huawei co. LTD
Machine Translation AIOps Log Analysis
M
Minggui He
Huawei, Beijing, China
W
Weibin Meng
Huawei, Beijing, China
Song Xu
Song Xu
JD AI Research
natural language processingtext generationrecommender systems
Z
Ziang Chen
Huawei, Beijing, China
C
Chen Liu
Huawei, Beijing, China
H
Hongxia Ma
Huawei, Beijing, China
L
Li Zhang
Huawei, Beijing, China
Boxing Chen
Boxing Chen
Huawei Technologies Canada
Natual Language ProcessingArtificial Intelligence
D
Daimeng Wei
Huawei, Beijing, China