🤖 AI Summary
This work addresses the challenge of simultaneously achieving stealthiness and fine-tuning resilience in backdoor attacks against multimodal contrastive learning. The authors propose BadCLIP++, the first framework to jointly model these dual objectives: it enhances stealth through semantically fused QR micro-triggers and target-aligned subset selection, while improving persistence via embedding stabilization—encompassing radius contraction, centroid alignment, and curvature control—combined with elastic weight consolidation. Theoretical analysis reveals gradient alignment between clean fine-tuning and backdoor objectives. Experiments demonstrate that with only a 0.3% poisoning rate, the attack achieves a 99.99% success rate in digital settings, remains effective (>99.90%) against 19 state-of-the-art defenses, incurs less than 0.8% drop in clean accuracy, and attains a 65.03% success rate under physical-world conditions.
📝 Abstract
Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.