Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently exhibit unintended code-switching—spurious, context-inappropriate language shifts—in multilingual settings, degrading response readability and usability. To address this, we conduct the first mechanistic analysis using sparse autoencoders and identify abnormally elevated pre-activation values of language-specific features as the primary cause. Building on this insight, we propose guided supervised fine-tuning (GSFT), a method that selectively modulates the pre-activations of language-identity features to enforce consistent language output. GSFT is validated across five state-of-the-art LLMs and three languages, reducing unintended code-switching by over 50% on average; in four evaluated scenarios, such switching is entirely eliminated. Crucially, the method preserves or improves performance across six multilingual benchmarks, demonstrating that language stability and multilingual capability can be simultaneously enhanced without trade-off.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $ extbf{S}$parse $ extbf{A}$utoencoder-guided $ extbf{S}$upervised $ extbf{F}$ine$ extbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models' performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.
Problem

Research questions and friction points this paper is trying to address.

Mitigating unexpected code-switching in multilingual LLMs
Reducing language mixing to improve response readability
Maintaining multilingual capabilities while preventing pre-activation issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoders for language feature analysis
Implements supervised finetuning to control pre-activation values
Reduces code-switching by over 50% in multilingual LLMs
🔎 Similar Papers
No similar papers found.
Boyi Deng
Boyi Deng
University of Science and Technology of China
LLMsMechanistic Interpretability
Y
Yu Wan
Tongyi Lab, Alibaba Group Inc
Baosong Yang
Baosong Yang
Alibaba-inc
Machine LearningLarge Language ModelMachine Translation
F
Fei Huang
Tongyi Lab, Alibaba Group Inc
W
Wenjie Wang
National University of Singapore
F
Fuli Feng
Institute of Dataspace