🤖 AI Summary
This work addresses the pervasive language confusion problem—i.e., unintended non-target language generation—in English-centric large language models (LLMs), conducting the first mechanistic interpretability study to localize critical layers and neurons (“confusion points”) responsible for cross-lingual switching failures. We propose a lightweight intervention method based on target-neuron attribution and editing, which suppresses confusion significantly without degrading general capabilities. Using the Language Confusion Benchmark (LCB), TunedLens-based hierarchical analysis, and cross-lingual model comparisons, we identify and modulate these key neurons. Experiments show that our approach achieves performance comparable to full multilingual alignment across most languages: confusion rates drop substantially, outputs become cleaner and higher-quality, and model fluency and general-purpose functionality remain preserved.
📝 Abstract
Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.