DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep neural networks are vulnerable to backdoor attacks, yet existing trigger inversion methods often require large amounts of clean data or impose strong assumptions on trigger morphology. This paper proposes a zero-shot, data-agnostic trigger inversion framework that leverages latent diffusion models: under gradient guidance from the target classifier, it iteratively optimizes a latent trigger pattern without accessing training data or presupposing trigger shape. By incorporating diffusion priors to constrain the search space, the method significantly enhances both inversion accuracy and interpretability. On the BackdoorBench benchmark, it achieves a 7.1% improvement in detection accuracy; for backdoor scanning in object detection models, it yields a 9.4% gain—outperforming state-of-the-art approaches. To our knowledge, this is the first work to systematically integrate diffusion generative models into backdoor trigger inversion.

Technology Category

Application Category

📝 Abstract
Deep neural networks have demonstrated remarkable success across numerous tasks, yet they remain vulnerable to Trojan (backdoor) attacks, raising serious concerns about their safety in real-world mission-critical applications. A common countermeasure is trigger inversion -- reconstructing malicious "shortcut" patterns (triggers) inserted by an adversary during training. Current trigger-inversion methods typically search the full pixel space under specific assumptions but offer no assurances that the estimated trigger is more than an adversarial perturbation that flips the model output. Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. Specifically, we incorporate a diffusion-based generator guided by the target classifier; through iterative generation, we produce candidate triggers that align with the internal representations the model relies on for malicious behavior. Empirical evaluations, both quantitative and qualitative, show that our approach reconstructs triggers that effectively distinguish clean versus Trojaned models. DISTIL surpasses alternative methods by high margins, achieving up to 7.1% higher accuracy on the BackdoorBench dataset and a 9.4% improvement on trojaned object detection model scanning, offering a promising new direction for reliable backdoor defense without reliance on extensive data or strong prior assumptions about triggers. The code is available at https://github.com/AdaptiveMotorControlLab/DISTIL.
Problem

Research questions and friction points this paper is trying to address.

Detect Trojan triggers in neural networks without data
Improve trigger inversion accuracy without strong assumptions
Enhance backdoor defense reliability in mission-critical applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-free trigger inversion via latent diffusion
Diffusion-based generator guided by classifier
Iterative generation aligns with malicious representations
Hossein Mirzaei
Hossein Mirzaei
PhD student @ Mackenzie Mathis Lab
Machine Learning
Z
Zeinab Taghavi
École Polytechnique Fédérale de Lausanne (EPFL)
S
Sepehr Rezaee
École Polytechnique Fédérale de Lausanne (EPFL)
M
Masoud Hadi
École Polytechnique Fédérale de Lausanne (EPFL)
M
Moein Madadi
École Polytechnique Fédérale de Lausanne (EPFL)
Mackenzie W. Mathis
Mackenzie W. Mathis
Swiss Federal Institute of Technology in Lausanne (EPFL)
Systems NeuroscienceSensorimotor ControlComputer VisionMachine Learning