🤖 AI Summary
This study investigates the formation mechanism of shared conceptual spaces in multilingual large language models and their impact on cross-lingual transfer, which remain poorly understood. Focusing on the EuroLLM pretraining dynamics, the authors combine activation patching with cross-lingual concept representation isolation to trace the emergence and evolution of language-agnostic representations. They further assess the causal influence of these representations on translation behavior by injecting translation prompts. The findings reveal that a shared conceptual space forms early in training and undergoes continuous refinement, with its alignment quality exhibiting language-specific variation. Notably, observed improvements in translation performance often stem from shifts in decoding strategies—such as altered word-sense selection or handling of homographs—rather than genuine gains in cross-lingual capability. This work offers novel insights into the training dynamics and causal interpretability of multilingual models.
📝 Abstract
Training Large Language Models (LLMs) with high multilingual coverage is becoming increasingly important -- especially when monolingual resources are scarce. Recent studies have found that LLMs process multilingual inputs in shared concept spaces, thought to support generalization and cross-lingual transfer. However, these prior studies often do not use causal methods, lack deeper error analysis or focus on the final model only, leaving open how these spaces emerge during training. We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM through the causal interpretability method of activation patching. We isolate cross-lingual concept representations, then inject them into a translation prompt to investigate how consistently translations can be altered, independently of the language. We find that shared concept spaces emerge early} and continue to refine, but that alignment with them is language-dependent}. Furthermore, in contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior -- like selecting senses for polysemous words or translating instead of copying cross-lingual homographs -- rather than improved translation ability. Our findings offer new insight into the training dynamics of cross-lingual alignment and the conditions under which causal interpretability methods offer meaningful insights in multilingual contexts.