🤖 AI Summary
This work addresses the challenges of backdoor defense in large language models (LLMs), which often incur high computational costs, performance degradation, or increased inference latency. The authors propose TIGS, a plug-and-play defense method that operates during inference without requiring parameter updates or external data. TIGS dynamically identifies and suppresses trigger-induced attention collapse through content-aware tail-risk screening and geometric smoothing of internal attention mechanisms. Notably, it achieves this with no additional training, no auxiliary generation, and minimal latency overhead. Experimental results demonstrate that TIGS substantially reduces backdoor attack success rates while rigorously preserving the model’s performance and semantic consistency on clean inputs. The approach is effective across diverse LLM architectures, including dense, sparse, and inference-optimized variants.
📝 Abstract
Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, or introduce severe latency via complex online interventions. To overcome this dichotomy, we present Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation. TIGS leverages the observation that successful backdoor triggers consistently induce localized attention collapse within the semantic content region. Operating entirely within the native forward pass, TIGS first performs content-aware tail-risk screening to identify suspicious attention heads and rows using sample-internal signals. It then applies intrinsic geometric smoothing: a weak content-domain correction preserves semantic anchoring, while a stronger full-row contraction disrupts trigger-dominant routing. Finally, a controlled full-row write-back reconstructs the attention matrix to ensure inference stability. Extensive evaluations demonstrate that TIGS substantially suppresses attack success rates while strictly preserving clean reasoning and open-ended semantic consistency. Crucially, this favorable security-utility-latency equilibrium persists across diverse architectures, including dense, reasoning-oriented, and sparse mixture-of-experts models. By structurally disrupting adversarial routing with marginal latency overhead, TIGS establishes a highly practical, deployment-ready defense standard for state-of-the-art LLMs.