A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

📅 2024-06-29

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study systematically investigates optimal strategies for leveraging parallel corpora to enhance multilingual large language models (MLLMs), focusing on the impacts of corpus quality and scale, training objectives, and model parameter count on both bilingual tasks (e.g., machine translation) and general cross-lingual tasks (e.g., text classification). Method: We propose a noise-filtering–based parallel corpus selection mechanism—bypassing error-prone language identification preprocessing—and employ supervised fine-tuning with a pure machine translation (MT) objective, integrated with multilingual pretraining. Contribution/Results: We find that merely ~10K high-quality parallel sentence pairs achieve performance comparable to large-scale corpora; the MT-only objective substantially outperforms multitask mixed objectives; and larger models benefit more markedly from parallel data. Experiments demonstrate consistent improvements across 12 languages and five cross-lingual task categories, establishing a reusable, efficient paradigm for parallel corpus utilization in MLLM development.

Technology Category

Application Category

📝 Abstract

Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus with just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.

Problem

Research questions and friction points this paper is trying to address.

Optimizing parallel corpora for multilingual models

Improving performance across diverse tasks

Enhancing large language models with parallel data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages parallel corpora quality

Employs machine translation objective

Enhances larger multilingual models

🔎 Similar Papers

No similar papers found.