From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Detecting stealthy backdoor triggers in large language models (LLMs) remains challenging due to their opacity and lack of explicit interpretability. Method: This paper proposes an introspection-based post-training defense framework that introduces *inversion-based reinforcement learning*—the first approach enabling LLMs to autonomously reason about, reconstruct, and explicitly articulate their latent backdoor trigger mechanisms without external prompts. We discover that such self-awareness exhibits an emergent, phase-transition-like behavior. Contribution/Results: Leveraging this insight, we design two novel defense strategies: trigger-aware fine-tuning and self-attributed distillation. Evaluated across five representative backdoor attack scenarios, our method consistently outperforms six state-of-the-art baselines, achieving substantial gains in both backdoor detection accuracy and robustness. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) can acquire deceptive behaviors through backdoor attacks, where the model executes prohibited actions whenever secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers implanted in the model. Motivated by recent findings on LLMs' situational awareness, we propose a novel post-training framework that cultivates self-awareness of backdoor risks and enables models to articulate implanted triggers even when they are absent from the prompt. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their own behaviors and reverse-engineer the triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such backdoor self-awareness emerges abruptly within a short training window, resembling a phase transition in capability. Building on this emergent property, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks. The code is available at LLM Backdoor Self-Awareness.

Problem

Research questions and friction points this paper is trying to address.

Addressing backdoor attacks that implant deceptive behaviors in LLMs

Developing self-awareness in models to identify hidden trigger patterns

Creating defense strategies to mitigate and detect backdoor threats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning framework for trigger reverse-engineering

Cultivates self-awareness of backdoor risks in models

Enables models to identify implanted triggers introspectively

🔎 Similar Papers

No similar papers found.