🤖 AI Summary
Existing studies predominantly focus on single-neuron analysis, failing to isolate language-specific units embedded within cross-lingual representations of large language models (LLMs). To address this, we propose SAE-LAPE: a method that employs sparse autoencoders (SAEs) to model multilingual feedforward layer activations, identifies monosemantic, language-specific, and interpretable neural features via feature activation probability, and localizes their concentrated distribution in deeper model layers. SAE-LAPE achieves effective disentanglement of language-specific representations and enables semantic visualization, substantially enhancing model interpretability. On zero-shot language identification, it matches fastText’s performance while offering fine-grained semantic readability and mechanistic transparency. This work provides the first systematic evidence of a hierarchical “abstract-to-concrete” conceptual organization in LLM multilingual representations—and reveals its intrinsic language dependence.
📝 Abstract
Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model's multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features .