Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how backdoor triggers in large language models hijack the models’ intrinsic linguistic encoding mechanisms to manipulate outputs, thereby uncovering the root cause of their security risks. Through activation patching and attention head analysis, the work reveals for the first time that backdoors do not construct independent circuits but instead repurpose pre-existing language-processing components learned during pretraining, with trigger pathways emerging as early as the initial layers. Experiments on the GAPperon model series (1B/8B/24B) demonstrate substantial overlap between trigger-associated attention heads and those naturally involved in linguistic encoding, with Jaccard similarity scores ranging from 0.18 to 0.66. This finding indicates a high degree of functional entanglement between backdoor behavior and normal model operations, offering a novel perspective and quantitative foundation for backdoor detection and defense strategies.

Technology Category

Application Category

📝 Abstract
Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.
Problem

Research questions and friction points this paper is trying to address.

backdoor attacks
trigger mechanisms
language circuits
large language models
model interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor attack
mechanistic analysis
language switching
activation patching
attention heads
🔎 Similar Papers
No similar papers found.