Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study addresses the suboptimal in-context learning (ICL) performance of large language models (LLMs) in low-resource languages. To tackle cross-lingual adaptation, we systematically evaluate five approaches—few-shot prompting, test-time translation, full-parameter fine-tuning, LoRA-based fine-tuning, embedding reinitialization, and instruction tuning—across five low-resource languages, three foundational multilingual LMs, and seven downstream tasks. We propose a novel evaluation metric, Valid Output Recall (VOR), which quantifies task-relevant output generation fidelity. Our analysis reveals, for the first time, that gradient-based methods severely impair generalization due to catastrophic forgetting. Empirically, the zero-training combination of few-shot prompting and test-time translation consistently outperforms all fine-tuning variants across languages and tasks. All experimental data, code, and trained models are publicly released.

Technology Category

Application Category

📝 Abstract

LLMs are typically trained in high-resource languages, and tasks in lower-resourced languages tend to underperform the higher-resource language counterparts for in-context learning. Despite the large body of work on prompting settings, it is still unclear how LLMs should be adapted cross-lingually specifically for in-context learning in the low-resource target languages. We perform a comprehensive study spanning five diverse target languages, three base LLMs, and seven downstream tasks spanning over 4,100 GPU training hours (9,900+ TFLOPs) across various adaptation techniques: few-shot prompting, translate-test, fine-tuning, embedding re-initialization, and instruction fine-tuning. Our results show that the few-shot prompting and translate-test settings tend to heavily outperform the gradient-based adaptation methods. To better understand this discrepancy, we design a novel metric, Valid Output Recall (VOR), and analyze model outputs to empirically attribute the degradation of these trained models to catastrophic forgetting. To the extent of our knowledge, this is the largest study done on in-context learning for low-resource languages with respect to train compute and number of adaptation techniques considered. We make all our datasets and trained models available for public use.

Problem

Research questions and friction points this paper is trying to address.

Adapting LLMs for in-context learning in low-resource languages

Comparing performance of adaptation techniques across languages

Addressing catastrophic forgetting in gradient-based adaptation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive study on LLM adaptation techniques

Few-shot prompting outperforms gradient-based methods

Novel metric VOR analyzes model output degradation

🔎 Similar Papers

No similar papers found.