🤖 AI Summary
In commercial black-box settings, cross-model transferability of adversarial attacks is often limited due to architectural and training disparities. Method: This work investigates the scaling relationship between the number of surrogate models and transfer success rate in black-box adversarial attacks, proposing a unified multi-model ensemble-based transfer attack framework compatible with diverse methods—including PGD and MI-FGSM—and evaluating it across heterogeneous architectures: CNNs, robustly trained models, and closed-source multimodal large language models (e.g., GPT-4o). Contribution/Results: We empirically discover and validate a clear positive scaling law: increasing ensemble size significantly improves transfer success rates—exceeding 90% on both standard models and GPT-4o—while simultaneously enhancing the semantic fidelity and interpretability of generated adversarial perturbations. This study establishes the first theoretical and practical foundation for scalable, high-transfer black-box attacks.
📝 Abstract
Adversarial examples usually exhibit good cross-model transferability, enabling attacks on black-box models with limited information about their architectures and parameters, which are highly threatening in commercial black-box scenarios. Model ensembling is an effective strategy to improve the transferability of adversarial examples by attacking multiple surrogate models. However, since prior studies usually adopt few models in the ensemble, there remains an open question of whether scaling the number of models can further improve black-box attacks. Inspired by the scaling law of large foundation models, we investigate the scaling laws of black-box adversarial attacks in this work. Through theoretical analysis and empirical evaluations, we conclude with clear scaling laws that using more surrogate models enhances adversarial transferability. Comprehensive experiments verify the claims on standard image classifiers, diverse defended models and multimodal large language models using various adversarial attack methods. Specifically, by scaling law, we achieve 90%+ transfer attack success rate on even proprietary models like GPT-4o. Further visualization indicates that there is also a scaling law on the interpretability and semantics of adversarial perturbations.