Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation

📅 2024-12-17

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study addresses the limited multilingual capability of large language models (e.g., Llama) for low-resource languages—exemplified by Persian—by proposing a staged, parameter-efficient fine-tuning paradigm: monolingual pretraining → bilingual representation alignment → instruction tuning. Its three key contributions are: (1) establishing a progressive language expansion framework that identifies the initial model’s foundational multilingual capacity as a critical bottleneck for low-resource adaptation; (2) empirically demonstrating that cross-lingual alignment yields negligible gains under extreme data scarcity, challenging the prevailing assumption of its necessity; and (3) achieving substantial improvements in Persian classification accuracy using only lightweight adapters (e.g., LoRA or Adapter modules), while simultaneously preserving—and even slightly improving—English task performance, thereby revealing only marginal knowledge transfer benefits for simple classification tasks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have made great progress in classification and text generation tasks. However, they are mainly trained on English data and often struggle with low-resource languages. In this study, we explore adding a new language, i.e., Persian, to Llama (a model with a limited understanding of Persian) using parameter-efficient fine-tuning. We employ a multi-stage approach involving pretraining on monolingual Persian data, aligning representations through bilingual pretraining and instruction datasets, and instruction-tuning with task-specific datasets. We evaluate the model's performance at each stage on generation and classification tasks. Our findings suggest that incorporating the Persian language, through bilingual data alignment, can enhance classification accuracy for Persian tasks, with no adverse impact and sometimes even improvements on English tasks. Additionally, the results highlight the model's initial strength as a critical factor when working with limited training data, with cross-lingual alignment offering minimal benefits for the low-resource language. Knowledge transfer from English to Persian has a marginal effect, primarily benefiting simple classification tasks.

Problem

Research questions and friction points this paper is trying to address.

Multilingual Adaptation

Resource-poor Languages

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual language models

low-resource language adaptation

incremental training strategy

🔎 Similar Papers

No similar papers found.