Language-Switching Triggers Take a Latent Detour Through Language Models

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the limited understanding of the internal mechanisms underlying language-switching backdoor attacks in language models, particularly how trigger sequences hijack model computations. Focusing on an 8B-parameter autoregressive language model, the authors employ a three-word Latin trigger to induce a switch from English to French outputs. Through attention head analysis, subspace decomposition of intermediate representations, and MLP behavior tracking, they reveal— for the first time—that the backdoor signal propagates through a latent subspace orthogonal to natural language identity directions. The attack mechanism is decomposed into three stages: trigger composition, orthogonal latent propagation, and MLP-mediated transformation. The work identifies a serial bottleneck critical to the attack; disrupting this location fully suppresses the backdoor effect but concurrently degrades normal model functionality, demonstrating the inadequacy of conventional language-feature-based defenses against such attacks.

📝 Abstract

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

Problem

Research questions and friction points this paper is trying to address.

backdoor attacks

language models

trigger mechanisms

latent representations

language switching

Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor attack

language-switching trigger

latent subspace