Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation

๐Ÿ“… 2025-10-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional context prompting ensembles average textual features in the feature space, often causing class-center shifts that impair generalization in few-shot vision-language models (VLMs). To address this, we propose Cluster-Aware Prompt Ensemble Learning (CAPEL), which performs prompt ensembling in the classification logits spaceโ€”thereby avoiding distribution distortion induced by feature averaging. Our key contributions are: (1) a clustering-aware prompt assignment strategy that groups semantically similar prompts; (2) logits-space weighted ensemble integration; and (3) a cluster-preserving regularization term coupled with an adaptive prompt weighting mechanism, explicitly maintaining discriminability and robustness within each prompt cluster. Extensive experiments across multiple few-shot vision benchmarks demonstrate that CAPEL consistently outperforms state-of-the-art methods, effectively mitigating prompt collapse and degradation in cross-dataset generalization.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses suboptimal feature averaging in prompt ensembling
Preserves cluster structure of context prompts during adaptation
Enhances discriminative power through logit-space ensemble integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-aware prompt ensemble learning for few-shot adaptation
Ensembling in classification logits space instead of feature space
Cluster-preserving regularization and adaptive prompt weighting
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhi Chen
University of Southern Queensland, Toowoomba, 4350, Queensland, Australia
X
Xin Yu
University of Queensland, Brisbane, 4072, Queensland, Australia
Xiaohui Tao
Xiaohui Tao
Full Professor, University of Southern Queensland, Australia
Artificial Intelligencedata miningmachine learningnatural language processingknowledge
Y
Yan Li
University of Southern Queensland, Toowoomba, 4350, Queensland, Australia
Zi Huang
Zi Huang
PhD Candidate
Deep Learning