ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing backdoor detection methods for large language models (LLMs) suffer from poor compatibility with autoregressive generation, prohibitively large output spaces, and high inference latency. To address these challenges, this paper proposes a lightweight, retraining-free, real-time detection method. Our key insight is the “sequence locking” phenomenon exhibited by backdoored models during target sequence generation—characterized by abnormally high and highly consistent token-level confidence scores. Leveraging this behavioral signature, we design a sliding-window-based dynamic monitoring mechanism that tracks token confidence in real time, enabling fine-grained detection via discriminative output-space patterns. Extensive experiments demonstrate near-perfect true positive rates (~100%) across diverse attack scenarios, negligible false positive rates, and virtually zero inference overhead—substantially outperforming state-of-the-art approaches and exhibiting strong practical deployability.

Technology Category

Application Category

📝 Abstract
Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.
Problem

Research questions and friction points this paper is trying to address.

Detect backdoor attacks in Large Language Models (LLMs)
Address poor performance of existing defense methods for LLMs
Enable real-time backdoor detection without additional latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects backdoors via token confidence monitoring
Identifies sequence lock in model outputs
Enables real-time detection with low latency
Z
Zihan Wang
University of Electronic Science and Technology of China
R
Rui Zhang
University of Electronic Science and Technology of China
H
Hongwei Li
University of Electronic Science and Technology of China
W
Wenshu Fan
University of Electronic Science and Technology of China
Wenbo Jiang
Wenbo Jiang
University of Electronic Science and Technology of China
AI securityBackdoor attack
Qingchuan Zhao
Qingchuan Zhao
City University of Hong Kong
Mobile securityIoT securityProgram AnalysisReverse Engineering
Guowen Xu
Guowen Xu
Professor, SMIEEE, University of Electronic Science and Technology of China
Applied CryptographyComputer SecurityAI Security and Privacy