The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model

📅 2024-12-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the evolutionary mechanism of multilingual capability in pretraining multilingual code large language models (Code LLMs), proposing the “Babel Hypothesis”: initially, languages share a unified representation system centered on a dominant language (e.g., English), which gradually differentiates into language-specific subsystems during training. Methodologically, we introduce— for the first time—work-language identification and language-transfer neuron analysis, integrated with neuron-level interpretability techniques via internal state tracking, to construct and reweight multilingual code corpora. Experiments demonstrate strong alignment between the hypothesis and observed representational evolution within the model. Leveraging this mechanistic insight, our optimized pretraining corpus yields substantial improvements in multilingual code understanding and generation across diverse programming languages, consistently outperforming all baselines. The study thus advances both theoretical understanding of multilingual representation learning in Code LLMs and practical strategies for effective multilingual pretraining.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown significant multilingual capabilities. However, the mechanisms underlying the development of these capabilities during pre-training are not well understood. In this paper, we use code LLMs as an experimental platform to explore the evolution of multilingual capabilities in LLMs during the pre-training process. Based on our observations, we propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities. During the learning process, multiple languages initially share a single knowledge system dominated by the primary language and gradually develop language-specific knowledge systems. We then validate the above hypothesis by tracking the internal states of the LLMs through identifying working languages and language transferring neurons. Experimental results show that the internal state changes of the LLM are consistent with our Babel Tower Hypothesis. Building on these insights, we propose a novel method to construct an optimized pre-training corpus for multilingual code LLMs, which significantly outperforms LLMs trained on the original corpus. The proposed Babel Tower Hypothesis provides new insights into designing pre-training data distributions to achieve optimal multilingual capabilities in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Understanding multilingual capability evolution in LLMs.
Proposing Babel Tower Hypothesis for language acquisition.
Optimizing pre-training corpus for multilingual code LLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Babel Tower Hypothesis for multilingual evolution
Tracks internal states via language-specific neurons
Optimizes pre-training corpus for multilingual LLMs
🔎 Similar Papers
No similar papers found.
J
Jiawei Chen
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Wentao Chen
Wentao Chen
Shanghai Jiao Tong University
Natural Language ProcessingMachine LearningRepresentation Learning
J
Jing Su
ByteDance
J
Jingjing Xu
ByteDance
H
Hongyu Lin
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
M
Mengjie Ren
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing