FLARE: Towards Universal Dataset Purification against Backdoor Attacks

📅 2024-11-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dataset purification methods rely on the implicit assumption that backdoor patterns are more easily learnable than benign features—a premise that fails under complex backdoor attacks such as all-to-all (A2A) and untargeted (UT). This work proposes FLARE, a general-purpose purification framework that jointly models discriminative representations of poisoned versus clean samples across both input-output and multiple hidden-layer spaces. FLARE achieves this via cross-layer anomaly activation aggregation and adaptive subspace selection. Its core contributions are threefold: (i) it is the first to explicitly identify and overcome the aforementioned assumption’s limitations; (ii) it introduces a stability-driven dual-clustering mechanism enabling unified defense against 22 diverse attack variants—including A2O, A2A, and UT; and (iii) it significantly outperforms state-of-the-art methods on multiple benchmark datasets while maintaining strong robustness against adaptive attacks.

Technology Category

Application Category

📝 Abstract
Deep neural networks (DNNs) are susceptible to backdoor attacks, where adversaries poison datasets with adversary-specified triggers to implant hidden backdoors, enabling malicious manipulation of model predictions. Dataset purification serves as a proactive defense by removing malicious training samples to prevent backdoor injection at its source. We first reveal that the current advanced purification methods rely on a latent assumption that the backdoor connections between triggers and target labels in backdoor attacks are simpler to learn than the benign features. We demonstrate that this assumption, however, does not always hold, especially in all-to-all (A2A) and untargeted (UT) attacks. As a result, purification methods that analyze the separation between the poisoned and benign samples in the input-output space or the final hidden layer space are less effective. We observe that this separability is not confined to a single layer but varies across different hidden layers. Motivated by this understanding, we propose FLARE, a universal purification method to counter various backdoor attacks. FLARE aggregates abnormal activations from all hidden layers to construct representations for clustering. To enhance separation, FLARE develops an adaptive subspace selection algorithm to isolate the optimal space for dividing an entire dataset into two clusters. FLARE assesses the stability of each cluster and identifies the cluster with higher stability as poisoned. Extensive evaluations on benchmark datasets demonstrate the effectiveness of FLARE against 22 representative backdoor attacks, including all-to-one (A2O), all-to-all (A2A), and untargeted (UT) attacks, and its robustness to adaptive attacks.
Problem

Research questions and friction points this paper is trying to address.

Detects and removes poisoned samples in datasets to prevent backdoor attacks
Addresses limitations of current methods in handling A2A and UT attacks
Proposes FLARE for universal purification using multi-layer activation analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aggregates abnormal activations from all hidden layers
Uses adaptive subspace selection for optimal clustering
Identifies poisoned clusters by assessing stability
L
Linshan Hou
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
W
Wei Luo
School of Information Technology, Deakin University, Australia
Zhongyun Hua
Zhongyun Hua
Professor, Harbin Institute of Technology, Shenzhen
Applied CryptographyTrustworthy AIMultimedia SecurityNonlinear Systems and Applications
S
Songhua Chen
Independent Researcher
L
Leo Yu Zhang
School of Information and Communication Technology, Griffith University, Southport, Gold Coast, QLD 4215, Australia
Y
Yiming Li
College of Computing and Data Science, Nanyang Technological University, Singapore 639798