sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting

📅 2024-07-13

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the significant performance degradation of multilingual large language models (LLMs) on non-English languages relative to English, this paper introduces sPhinX: an efficient multilingual instruction-tuning dataset and an accompanying N-shot guided fine-tuning paradigm. Methodologically, sPhinX curates high-quality translated instruction-response pairs across 50 languages, leveraging multilingual translation filtering, N-shot example injection, and synthetic data distillation—ensuring English benchmark stability while enhancing cross-lingual generalization. Its core contributions are the first N-shot guided prompt-based fine-tuning framework and a cost-effective, highly diverse synthetic data construction mechanism. Experiments demonstrate that sPhinX yields average improvements of +5 percentage points on Mistral-7B and Phi-Small; integrating N-shot prompting further boosts performance by +9% and +4%, respectively. Compared to direct translation baselines, sPhinX achieves +7% and +4% gains, substantially outperforming existing open-source multilingual instruction datasets.

Technology Category

Application Category

📝 Abstract

Despite the remarkable success of LLMs in English, there is a significant gap in performance in non-English languages. In order to address this, we introduce a novel recipe for creating a multilingual synthetic instruction tuning dataset, sPhinX, which is created by selectively translating instruction response pairs from English into 50 languages. We test the effectiveness of sPhinx by using it to fine-tune two state-of-the-art models, Mistral-7B and Phi-Small and then evaluating them across a comprehensive suite of multilingual benchmarks that test reasoning, question answering, reading comprehension and machine translation. Our results show that Mistral-7B and Phi-Small fine-tuned with sPhinX perform better on an average by 5%pt for both the models when compared to the base variants of these models. We also devise a strategy to incorporate N-shot examples in each fine-tuning sample which further boosts the performance of these models by 9%pt and 4%pt respectively respectively compared to vanilla fine-tuning. To show efficacy of our data curation approach, we also directly translate our original dataset to the target languages, and observe an increase of 7%pt and 4%pt on both the models respectively. sPhinX outperforms other multilingual instruction tuning datasets in both efficiency and diversity, reducing dataset creation costs. It also maintains strong performance on standard English LLM benchmarks, with minimal regression.

Problem

Research questions and friction points this paper is trying to address.

Bridging performance gap in non-English multilingual LLMs

Enhancing multilingual instruction dataset diversity strategically

Improving model performance via context-aware N-shot fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual synthetic dataset construction

N-shot guided fine-tuning strategy

Selective augmentation with translations

🔎 Similar Papers

No similar papers found.