Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large language models (LLMs) frequently exhibit unintended code-switching—spurious, context-inappropriate language shifts—in multilingual settings, degrading response readability and usability. To address this, we conduct the first mechanistic analysis using sparse autoencoders and identify abnormally elevated pre-activation values of language-specific features as the primary cause. Building on this insight, we propose guided supervised fine-tuning (GSFT), a method that selectively modulates the pre-activations of language-identity features to enforce consistent language output. GSFT is validated across five state-of-the-art LLMs and three languages, reducing unintended code-switching by over 50% on average; in four evaluated scenarios, such switching is entirely eliminated. Crucially, the method preserves or improves performance across six multilingual benchmarks, demonstrating that language stability and multilingual capability can be simultaneously enhanced without trade-off.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $ extbf{S}$parse $ extbf{A}$utoencoder-guided $ extbf{S}$upervised $ extbf{F}$ine$ extbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models' performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.

Problem

Research questions and friction points this paper is trying to address.

Mitigating unexpected code-switching in multilingual LLMs

Reducing language mixing to improve response readability

Maintaining multilingual capabilities while preventing pre-activation issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoders for language feature analysis

Implements supervised finetuning to control pre-activation values

Reduces code-switching by over 50% in multilingual LLMs

🔎 Similar Papers

No similar papers found.