🤖 AI Summary
Black-box large models (e.g., API-hosted pre-trained models) are inaccessible in terms of parameters and architecture, and suffer from severe GPU memory constraints. Method: We propose BlackVIP, a vision-language prompting method requiring no internal model information. It employs an input-dependent prompt generation mechanism coupled with SPSA-GC-based gradient estimation, enabling memory-efficient, backpropagation-free adaptation. We further introduce BlackVIP-SE—a lightweight variant—and establish, for the first time, a theoretical connection between visual prompting and randomized smoothing robustness, formally explaining its improved adversarial robustness. Contribution/Results: Evaluated across 19 cross-domain datasets, BlackVIP significantly reduces GPU memory consumption and computational overhead while enhancing out-of-distribution generalization and adversarial robustness—without accessing model internals or gradients.
📝 Abstract
With a surge of large-scale pre-trained models, parameter-efficient transfer learning (PETL) of large models has garnered significant attention. While promising, they commonly rely on two optimistic assumptions: 1) full access to the parameters of a PTM, and 2) sufficient memory capacity to cache all intermediate activations for gradient computation. However, in most real-world applications, PTMs serve as black-box APIs or proprietary software without full parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. This work proposes black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge of their architectures or parameters. BlackVIP has two components: 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent visual prompts, which allow the target PTM to adapt in the wild. SPSA-GC efficiently estimates the gradient of PTM to update Coordinator. Besides, we introduce a variant, BlackVIP-SE, which significantly reduces the runtime and computational cost of BlackVIP. Extensive experiments on 19 datasets demonstrate that BlackVIPs enable robust adaptation to diverse domains and tasks with minimal memory requirements. We further provide a theoretical analysis on the generalization of visual prompting methods by presenting their connection to the certified robustness of randomized smoothing, and presenting an empirical support for improved robustness.