🤖 AI Summary
Weak cross-lingual transfer—particularly to low-resource languages—remains a key limitation of decoder-only large language models (LLMs). To address this, we propose a novel pretraining strategy grounded in an active forgetting mechanism. This work is the first to introduce active forgetting regularization into decoder-only multilingual pretraining, integrating multilingual mixed-data training with representation learning analysis. The approach significantly enhances zero-shot cross-lingual generalization to unseen languages. Experiments demonstrate consistent and substantial improvements over same-scale baselines across multilingual downstream tasks—including XNLI and XQuAD—with especially pronounced gains for low-resource languages. Remarkably, the resulting decoder-only models achieve cross-lingual transfer performance on par with strong encoder-based multilingual models (e.g., XLM-RoBERTa). Our method establishes a new paradigm for multilingual modeling with decoder-only architectures, advancing their viability for truly inclusive, resource-agnostic language understanding.
📝 Abstract
Large Language Models (LLMs) demonstrate exceptional capabilities in a multitude of NLP tasks. However, the efficacy of such models to languages other than English is often limited. Prior works have shown that encoder-only models such as BERT or XLM-RoBERTa show impressive cross lingual transfer of their capabilities from English to other languages. In this work, we propose a pretraining strategy that uses active forgetting to achieve similar cross lingual transfer in decoder-only LLMs. We show that LLMs pretrained with active forgetting are highly effective when adapting to new and unseen languages. Through extensive experimentation, we find that LLMs pretrained with active forgetting are able to learn better multilingual representations which translates to better performance in many downstream tasks.