Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

The internal representation mechanisms of multilingual large language models (MLLMs) remain poorly understood—particularly whether they construct a shared cross-lingual semantic space, and why strong monolingual biases persist. To address this, we propose the Cross-Layer Transcoder (CLT) and attribution graph analysis to systematically trace representation evolution across layers. Our findings reveal: (1) hidden layers universally employ highly shared “hub-language” representations, with semantic information tightly aligned across languages; (2) language identity is not encoded in early embeddings but is linearly decoded from shallow to deep layers, relying on high-frequency language-specific features for efficient identification; (3) targeted intervention in this linear decoding pathway enables precise output language switching. This work is the first to uncover a two-stage “shared representation–linear decoding” mechanism in MLLMs, demonstrating that dominant language bias originates from feature-dependent decoding—not representation misalignment—thereby providing both theoretical foundations and practical methods for controllable multilingual generation and representation alignment.

Technology Category

Application Category

📝 Abstract

Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance still favor the dominant training language? To address this, we train a series of LLMs on different mixtures of multilingual data and analyze their internal mechanisms using cross-layer transcoders (CLT) and attribution graphs. Our results provide strong evidence for pivot language representations: the model employs nearly identical representations across languages, while language-specific decoding emerges in later layers. Attribution analyses reveal that decoding relies in part on a small set of high-frequency language features in the final layers, which linearly read out language identity from the first layers in the model. By intervening on these features, we can suppress one language and substitute another in the model's outputs. Finally, we study how the dominant training language influences these mechanisms across attribution graphs and decoding pathways. We argue that understanding this pivot-language mechanism is crucial for improving multilingual alignment in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Investigating how multilingual LLMs form shared representations across languages

Analyzing why performance favors dominant training languages despite multilingual capabilities

Identifying pivot language mechanisms through cross-layer transcoders and attribution graphs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-layer transcoders analyze multilingual representations

Attribution graphs reveal pivot language mechanisms

Intervening on features substitutes language outputs

🔎 Similar Papers

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis