🤖 AI Summary
To address poor generalization of fixed category labels in visual reprogramming (VR), this work proposes an attribute-driven VR framework. It replaces manually defined class names with fine-grained, descriptive (DesAttrs) and discriminative (DistAttrs) textual attributes, enabling sample-adaptive semantic guidance. A k-nearest-neighbor dynamic attribute retrieval strategy is introduced to reduce intra-class variance and enhance inter-class discriminability. Integrated within the CLIP zero-shot transfer paradigm, the method performs iterative, learnable visual perturbation optimization and supports both ViT and ResNet backbones. Evaluated on 12 downstream image classification tasks, it significantly outperforms existing baselines. This is the first VR approach to explicitly model attribute-level semantics, thereby unifying and improving cross-domain generalization across diverse CLIP architectures.
📝 Abstract
Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the $k$-nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/AttrVR.