Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Backdoor vulnerabilities in pre-trained language model (PLM) supply chains are highly transferable and pose severe security threats; existing defenses—largely reliant on output anomaly detection—exhibit insufficient robustness. This paper introduces Patronus, the first framework to detect backdoors by leveraging the invariance of input-side triggers during fine-tuning. It proposes a Multi-Trigger Contrastive Search algorithm, integrating contrastive learning with gradient-based optimization to identify stealthy triggers, and designs a two-stage mitigation strategy comprising real-time input monitoring and adversarial training–based model purification. Evaluated across 15 PLMs and 10 downstream tasks, Patronus achieves ≥98.7% detection recall and reduces attack success rates to benign levels—substantially outperforming state-of-the-art methods. Its core innovations lie in modeling trigger invariance at the input level and establishing a synergistic detection-purification optimization mechanism.

Technology Category

Application Category

📝 Abstract
Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $geq98.7%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.
Problem

Research questions and friction points this paper is trying to address.

Detects and mitigates transferable backdoors in pre-trained language models
Addresses defense failure due to fine-tuning altering model parameters
Proposes a framework using input-side invariance and multi-trigger contrastive search
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses input-side invariance of triggers for defense
Introduces multi-trigger contrastive search algorithm
Employs dual-stage mitigation with monitoring and purification
🔎 Similar Papers
No similar papers found.
T
Tianhang Zhao
School of Computer Science, Shanghai Jiao Tong University
W
Wei Du
Ant Group
Haodong Zhao
Haodong Zhao
Shanghai Jiao Tong University
Federated LearningLLM
S
Sufeng Duan
School of Computer Science, Shanghai Jiao Tong University
G
Gongshen Liu
School of Computer Science, Shanghai Jiao Tong University