🤖 AI Summary
Weak transferability in black-box adversarial attacks stems from existing methods ignoring architectural disparities between source and target models. To address this, we propose Inverse Knowledge Distillation (IKD), the first approach to reverse the knowledge distillation paradigm for gradient-based attacks: it incorporates a distillation-style loss into standard frameworks (e.g., PGD, MI-FGSM) to enforce model-agnostic regularization at the gradient level, thereby mitigating overfitting to the source model. Additionally, IKD integrates gradient diversity constraints across multiple surrogate models with ensemble-based optimization. Evaluated on ImageNet, IKD significantly enhances cross-architecture transferability—achieving an average 12.7% improvement in attack success rate—and establishes new state-of-the-art performance across 12 mainstream heterogeneous models, demonstrating superior generalization and robustness.
📝 Abstract
In recent years, the rapid development of deep neural networks has brought increased attention to the security and robustness of these models. While existing adversarial attack algorithms have demonstrated success in improving adversarial transferability, their performance remains suboptimal due to a lack of consideration for the discrepancies between target and source models. To address this limitation, we propose a novel method, Inverse Knowledge Distillation (IKD), designed to enhance adversarial transferability effectively. IKD introduces a distillation-inspired loss function that seamlessly integrates with gradient-based attack methods, promoting diversity in attack gradients and mitigating overfitting to specific model architectures. By diversifying gradients, IKD enables the generation of adversarial samples with superior generalization capabilities across different models, significantly enhancing their effectiveness in black-box attack scenarios. Extensive experiments on the ImageNet dataset validate the effectiveness of our approach, demonstrating substantial improvements in the transferability and attack success rates of adversarial samples across a wide range of models.