Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning.

📅 2025-05-26

🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the performance degradation of parameter-efficient fine-tuning (PEFT) for multi-label image recognition (MLR) in vision-language contrastive models (e.g., CLIP), caused by modality gaps between image and text representations, this paper proposes T2I-PAL—a novel text-to-image generation–assisted PEFT paradigm. T2I-PAL leverages text-to-image diffusion models (e.g., Stable Diffusion) to synthesize class-relevant, high-fidelity images that bridge the cross-modal gap. It further incorporates class-aware heatmap modeling and learnable prototype embeddings to enhance local visual representation learning. The framework jointly optimizes prompt tuning and lightweight adapters, requiring no full semantic annotations, preserving the original CLIP architecture, and enabling plug-and-play deployment. Extensive experiments on MS-COCO, VOC2007, and NUS-WIDE demonstrate an average mAP improvement of 3.47%, significantly outperforming existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Reduces modality gap in text-to-image recognition

Enhances multi-label recognition with heatmaps and prototypes

Improves parameter-efficient fine-tuning via prompt-adapter learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates images from text to reduce modality gap

Uses class-wise heatmap and learnable prototypes

Combines prompt tuning and adapter learning

🔎 Similar Papers

No similar papers found.