🤖 AI Summary
This study addresses the limited understanding of the internal mechanisms underlying language-switching backdoor attacks in language models, particularly how trigger sequences hijack model computations. Focusing on an 8B-parameter autoregressive language model, the authors employ a three-word Latin trigger to induce a switch from English to French outputs. Through attention head analysis, subspace decomposition of intermediate representations, and MLP behavior tracking, they reveal— for the first time—that the backdoor signal propagates through a latent subspace orthogonal to natural language identity directions. The attack mechanism is decomposed into three stages: trigger composition, orthogonal latent propagation, and MLP-mediated transformation. The work identifies a serial bottleneck critical to the attack; disrupting this location fully suppresses the backdoor effect but concurrently degrades normal model functionality, demonstrating the inadequacy of conventional language-feature-based defenses against such attacks.
📝 Abstract
Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.