Scanning Trojaned Models Using Out-of-Distribution Samples

📅 2025-01-28
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging problem of backdoor (trojan) detection in deep neural networks—particularly in adversarially trained models and zero-training-data scenarios. We propose TRODO, a general-purpose, prior-free scanning method that requires neither knowledge of attack strategies nor access to training data or label mappings. Our core insight is to exploit out-of-distribution (OOD) samples: by generating adversarial perturbations on OOD inputs, we induce “blind spots”—abnormal model behaviors where OOD samples are erroneously classified as in-distribution (ID) with high confidence. TRODO models these blind spots via confidence-score analysis and statistical significance testing to identify backdoors. To our knowledge, TRODO establishes the first blind-spot detection paradigm grounded in OOD adversarial shift. Evaluated across diverse architectures, datasets, and trojan variants—including those embedded in adversarially trained models—TRODO achieves an average detection accuracy exceeding 94%, operates without any training data, and demonstrates strong generalization and deployment robustness.

Technology Category

Application Category

📝 Abstract
Scanning for trojan (backdoor) in deep neural networks is crucial due to their significant real-world applications. There has been an increasing focus on developing effective general trojan scanning methods across various trojan attacks. Despite advancements, there remains a shortage of methods that perform effectively without preconceived assumptions about the backdoor attack method. Additionally, we have observed that current methods struggle to identify classifiers trojaned using adversarial training. Motivated by these challenges, our study introduces a novel scanning method named TRODO (TROjan scanning by Detection of adversarial shifts in Out-of-distribution samples). TRODO leverages the concept of"blind spots"--regions where trojaned classifiers erroneously identify out-of-distribution (OOD) samples as in-distribution (ID). We scan for these blind spots by adversarially shifting OOD samples towards in-distribution. The increased likelihood of perturbed OOD samples being classified as ID serves as a signature for trojan detection. TRODO is both trojan and label mapping agnostic, effective even against adversarially trained trojaned classifiers. It is applicable even in scenarios where training data is absent, demonstrating high accuracy and adaptability across various scenarios and datasets, highlighting its potential as a robust trojan scanning strategy.
Problem

Research questions and friction points this paper is trying to address.

Deep Learning Security
Trojan Detection
Adversarial Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trodo
Blindspot Concept
Adversarial Training Robustness
🔎 Similar Papers
No similar papers found.
Hossein Mirzaei
Hossein Mirzaei
PhD student @ Mackenzie Mathis Lab
Machine Learning
Ali Ansari
Ali Ansari
PhD student at Temple university
NLPData MiningVLM
B
Bahar Dibaei Nia
Sharif University of Technology
Mojtaba Nafez
Mojtaba Nafez
Master's Student, Department of Computer Engineering, Sharif University of Technology
Machine Learning
M
Moein Madadi
Sharif University of Technology
S
Sepehr Rezaee
Shahid Beheshti University
Z
Z. Taghavi
Sharif University of Technology
A
Arad Maleki
Sharif University of Technology
K
Kian Shamsaie
Sharif University of Technology
M
Mahdi Hajialilue
Sharif University of Technology
Jafar Habibi
Jafar Habibi
Professor, Department of Computer Engineering, Sharif University of Technology
Enterprise ArchitectureSoftware EngineeringSoftware ArchitectureSimulation and Performance
Mohammad Sabokrou
Mohammad Sabokrou
Okinawa Institute of Science and Technology
Machine LearningComputer VisionTrustworthy AI
M
M. H. Rohban
Sharif University of Technology