🤖 AI Summary
To address the performance degradation of parameter-efficient fine-tuning (PEFT) for multi-label image recognition (MLR) in vision-language contrastive models (e.g., CLIP), caused by modality gaps between image and text representations, this paper proposes T2I-PAL—a novel text-to-image generation–assisted PEFT paradigm. T2I-PAL leverages text-to-image diffusion models (e.g., Stable Diffusion) to synthesize class-relevant, high-fidelity images that bridge the cross-modal gap. It further incorporates class-aware heatmap modeling and learnable prototype embeddings to enhance local visual representation learning. The framework jointly optimizes prompt tuning and lightweight adapters, requiring no full semantic annotations, preserving the original CLIP architecture, and enabling plug-and-play deployment. Extensive experiments on MS-COCO, VOC2007, and NUS-WIDE demonstrate an average mAP improvement of 3.47%, significantly outperforming existing state-of-the-art methods.
📝 Abstract
Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.