Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the pervasive language confusion problem—i.e., unintended non-target language generation—in English-centric large language models (LLMs), conducting the first mechanistic interpretability study to localize critical layers and neurons (“confusion points”) responsible for cross-lingual switching failures. We propose a lightweight intervention method based on target-neuron attribution and editing, which suppresses confusion significantly without degrading general capabilities. Using the Language Confusion Benchmark (LCB), TunedLens-based hierarchical analysis, and cross-lingual model comparisons, we identify and modulate these key neurons. Experiments show that our approach achieves performance comparable to full multilingual alignment across most languages: confusion rates drop substantially, outputs become cleaner and higher-quality, and model fluency and general-purpose functionality remain preserved.

Technology Category

Application Category

📝 Abstract
Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.
Problem

Research questions and friction points this paper is trying to address.

Understanding language confusion in English-centric LLMs
Identifying neuron-level causes of unintended language switching
Mitigating confusion via targeted neuron editing without performance loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mechanistic interpretability study of language confusion
Editing critical neurons to mitigate confusion
Layer-wise analysis with TunedLens and neuron attribution