π€ AI Summary
This work addresses the limitations of existing sparse autoencoders (SAEs), which are predominantly trained on English data and rely on heuristic strategies for selecting intervention layers, thereby hindering reliable multilingual control. To overcome this, the authors propose training SAEs on multilingual data and introduce a novel prior criterion based on the intersection of multilingual representation alignment and language separability. This approach enables precise selection of effective intervention layers without exhaustive layer-wise search. Evaluated on LLaMA-3.1-8B and Gemma-2-9B, the method significantly improves language identification accuracy and generation quality in machine translation and cross-lingual summarization tasks, demonstrating consistent gains across multiple metrics including SpBLEU, ROUGE-L, COMET, and LaSE.
π Abstract
Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an \emph{a priori} steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.