Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Understanding the neural mechanisms underlying multilingual capabilities in large language models (LLMs) remains challenging, particularly regarding the identification, quantification, and controllable utilization of language-specific features. Method: We introduce the first monolinguality metric for LLMs, derived from sparse autoencoder (SAE) activation decomposition. Leveraging SAEs, we identify and characterize decoupled, cluster-distributed language-specific neural features, and empirically validate their synergistic enhancement effects. We further perform targeted feature ablation and generation steering—intervening exclusively on language-corresponding SAE features—to achieve selective language control. Contribution/Results: Our approach enables precise suppression of target-language generation via single-feature intervention, while joint intervention across multiple language-specific features significantly improves stability and accuracy in language-directed generation. This work establishes an interpretable, actionable paradigm for analyzing and modulating multilingual representations in LLMs.

Technology Category

Application Category

📝 Abstract

The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs.

Problem

Research questions and friction points this paper is trying to address.

Identifying language-specific features in multilingual LLMs using Sparse Autoencoders

Measuring monolinguality of SAE features and their impact on LLM performance

Enhancing language control in LLMs via SAE-derived steering vectors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using Sparse Autoencoders to analyze LLM activations

Introducing a metric to assess feature monolinguality

Enhancing steering vectors with language-specific features

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models