Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning.

📅 2025-05-26
🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation of parameter-efficient fine-tuning (PEFT) for multi-label image recognition (MLR) in vision-language contrastive models (e.g., CLIP), caused by modality gaps between image and text representations, this paper proposes T2I-PAL—a novel text-to-image generation–assisted PEFT paradigm. T2I-PAL leverages text-to-image diffusion models (e.g., Stable Diffusion) to synthesize class-relevant, high-fidelity images that bridge the cross-modal gap. It further incorporates class-aware heatmap modeling and learnable prototype embeddings to enhance local visual representation learning. The framework jointly optimizes prompt tuning and lightweight adapters, requiring no full semantic annotations, preserving the original CLIP architecture, and enabling plug-and-play deployment. Extensive experiments on MS-COCO, VOC2007, and NUS-WIDE demonstrate an average mAP improvement of 3.47%, significantly outperforming existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Reduces modality gap in text-to-image recognition
Enhances multi-label recognition with heatmaps and prototypes
Improves parameter-efficient fine-tuning via prompt-adapter learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates images from text to reduce modality gap
Uses class-wise heatmap and learnable prototypes
Combines prompt tuning and adapter learning
🔎 Similar Papers
No similar papers found.
Chun-Mei Feng
Chun-Mei Feng
Assistant Professor/Ad Astra Fellow, University College Dublin, Ireland
AI for HealthCareMulti-modal LearningFederated Learning
K
Kai Yu
University of Minnesota, Minneapolis, MN 55455, USA
Xinxing Xu
Xinxing Xu
Microsoft Research
Artificial IntelligenceDeep LearningComputer VisionIndustrial AIDigital Health
S
Salman Khan
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), UAE, and Australian National University, Canberra ACT, Australia
R
R. Goh
Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore
Wangmeng Zuo
Wangmeng Zuo
School of Computer Science and Technology, Harbin Institute of Technology
Computer VisionImage ProcessingGenerative AIDeep LearningBiometrics
Y
Yong Liu
Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore