đ¤ AI Summary
The internal representation mechanisms of multilingual large language models (MLLMs) remain poorly understoodâparticularly whether they construct a shared cross-lingual semantic space, and why strong monolingual biases persist. To address this, we propose the Cross-Layer Transcoder (CLT) and attribution graph analysis to systematically trace representation evolution across layers. Our findings reveal: (1) hidden layers universally employ highly shared âhub-languageâ representations, with semantic information tightly aligned across languages; (2) language identity is not encoded in early embeddings but is linearly decoded from shallow to deep layers, relying on high-frequency language-specific features for efficient identification; (3) targeted intervention in this linear decoding pathway enables precise output language switching. This work is the first to uncover a two-stage âshared representationâlinear decodingâ mechanism in MLLMs, demonstrating that dominant language bias originates from feature-dependent decodingânot representation misalignmentâthereby providing both theoretical foundations and practical methods for controllable multilingual generation and representation alignment.
đ Abstract
Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance still favor the dominant training language? To address this, we train a series of LLMs on different mixtures of multilingual data and analyze their internal mechanisms using cross-layer transcoders (CLT) and attribution graphs. Our results provide strong evidence for pivot language representations: the model employs nearly identical representations across languages, while language-specific decoding emerges in later layers. Attribution analyses reveal that decoding relies in part on a small set of high-frequency language features in the final layers, which linearly read out language identity from the first layers in the model. By intervening on these features, we can suppress one language and substitute another in the model's outputs. Finally, we study how the dominant training language influences these mechanisms across attribution graphs and decoding pathways. We argue that understanding this pivot-language mechanism is crucial for improving multilingual alignment in LLMs.