🤖 AI Summary
In multilingual translation, fine-tuning foundation models often induces catastrophic forgetting—degrading performance on unseen languages—yet the precise triggering conditions remain unclear. This work investigates the impact of model architecture, training data scale, and fine-tuning methodology on forgetting via controlled experiments on machine translation benchmarks. Key findings are: (1) the relative scale between model capacity and target-language data volume is the primary determinant of forgetting; (2) instruction-following capability exerts greater influence than architectural choices; and (3) cross-lingual alignment substantially mitigates forgetting and enables positive transfer. Experiments span full-parameter fine-tuning and multiple parameter-efficient fine-tuning (PEFT) methods, revealing no universal advantage of PEFT over standard fine-tuning. To our knowledge, this is the first study to empirically delineate the boundary conditions of forgetting in multilingual fine-tuning, providing reproducible evidence and actionable strategies for preserving and enhancing multilingual generalization.
📝 Abstract
Fine-tuning multilingual foundation models on specific languages often induces catastrophic forgetting, degrading performance on languages unseen in fine-tuning. While this phenomenon is widely-documented, the literature presents fragmented results about when forgetting occurs. To address this ambiguity, we conduct a systematic empirical study using machine translation as a testbed to identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning. Through controlled experiments across different model architectures, data scales, and fine-tuning approaches, we reveal that the relative scale between model and data size is a primary determinant of forgetting. Moreover, we demonstrate that a model's instruction-following ability is more critical for retaining multilingual knowledge than its architecture. Contrary to assumptions, parameter-efficient fine-tuning offers no clear advantage over full fine-tuning in mitigating forgetting. Lastly, we show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.