Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing studies predominantly focus on single-neuron analysis, failing to isolate language-specific units embedded within cross-lingual representations of large language models (LLMs). To address this, we propose SAE-LAPE: a method that employs sparse autoencoders (SAEs) to model multilingual feedforward layer activations, identifies monosemantic, language-specific, and interpretable neural features via feature activation probability, and localizes their concentrated distribution in deeper model layers. SAE-LAPE achieves effective disentanglement of language-specific representations and enables semantic visualization, substantially enhancing model interpretability. On zero-shot language identification, it matches fastText’s performance while offering fine-grained semantic readability and mechanistic transparency. This work provides the first systematic evidence of a hierarchical “abstract-to-concrete” conceptual organization in LLM multilingual representations—and reveals its intrinsic language dependence.

Technology Category

Application Category

📝 Abstract

Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model's multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features .

Problem

Research questions and friction points this paper is trying to address.

Identify language-specific features in multilingual LLMs

Improve interpretability of cross-lingual representations

Enhance language identification using sparse autoencoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders identify language-specific features

SAE-LAPE method analyzes feature activation probability

Features found in middle to final model layers

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models