Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

πŸ“… 2025-11-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Long-tailed multi-label visual recognition faces two intertwined challenges: extreme label distribution skew and incompatibility between conventional multi-label semantic modeling and zero-shot architectures. Existing methods rely on scarce tail-class samples to learn unreliable label correlations, while models like CLIP are inherently designed for single-label matching and lack explicit mechanisms for global label relationship modeling. To address this, we propose the end-to-end Correlation-Aware Prompting and Integration (CAPI) networkβ€”the first to explicitly leverage the text encoder for capturing holistic label semantics. CAPI integrates learnable soft prompts with graph convolutional layers to enable label-aware semantic propagation. Additionally, we introduce a distribution-balanced focal loss and a class-aware reweighting scheme to mitigate tail-class bias. With only lightweight parameter tuning, CAPI achieves substantial gains in tail-class performance and sets new state-of-the-art results on VOC-LT, COCO-LT, and NUS-WIDE.

Technology Category

Application Category

πŸ“ Abstract
Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.
Problem

Research questions and friction points this paper is trying to address.

Addressing biased models in long-tailed multi-label visual recognition with imbalanced class distributions
Overcoming unreliable semantic relationships for tail classes derived from imbalanced datasets
Adapting CLIP's single-label paradigm for effective multi-label visual recognition tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph network models label correlations from CLIP
Learnable prompts refine embeddings for multi-label tasks
Parameter-efficient fine-tuning balances head and tail classes
πŸ”Ž Similar Papers
No similar papers found.
W
Wei Tang
School of Computer Science and Engineering, Southeast University, Nanjing 210096, China, the Key Laboratory of Computer Network and Information Integration (Southeast University), MoE, China, and with the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Z
Zuo-Zheng Wang
School of Computer Science and Engineering, Southeast University, Nanjing 210096, China, and the Key Laboratory of Computer Network and Information Integration (Southeast University), MoE, China
K
Kun Zhang
Carnegie Mellon University, Pittsburgh, PA 15213 USA, and with the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Tong Wei
Tong Wei
Southeast University
Machine Learning
Min-Ling Zhang
Min-Ling Zhang
Professor, School of Computer Science and Engineering, Southeast University, China
Artificial IntelligenceMachine LearningData Mining