🤖 AI Summary
Conventional few-shot learning (FSL) methods relying solely on class-name text embeddings suffer from insufficient visual representation diversity. Method: This paper proposes BCT-CLIP, which leverages large language models (LLMs) to automatically discover discriminative dominant attributes from images, enabling fine-grained and multi-granular visual representations beyond coarse category names. It introduces a Multi-Attribute Generator (MPG) and a clustering-based pruning mechanism for attribute refinement, and incorporates attribute-level contrastive learning to jointly encode global category semantics and local patch-aware features. The framework integrates LLM guidance, cross-modal cross-attention, and contrastive learning to significantly enhance inter-class discrimination. Contribution/Results: BCT-CLIP achieves state-of-the-art performance across 11 mainstream FSL benchmarks, empirically validating that dominant attribute mining is critical for improving few-shot generalization.
📝 Abstract
Few-shot Learning (FSL), which endeavors to develop the generalization ability for recognizing novel classes using only a few images, faces significant challenges due to data scarcity. Recent CLIP-like methods based on contrastive language-image pertaining mitigate the issue by leveraging textual representation of the class name for unseen image discovery. Despite the achieved success, simply aligning visual representations to class name embeddings would compromise the visual diversity for novel class discrimination. To this end, we proposed a novel Few-Shot Learning (FSL) method (BCT-CLIP) that explores extbf{dominating properties} via contrastive learning beyond simply using class tokens. Through leveraging LLM-based prior knowledge, our method pushes forward FSL with comprehensive structural image representations, including both global category representation and the patch-aware property embeddings. In particular, we presented a novel multi-property generator (MPG) with patch-aware cross-attentions to generate multiple visual property tokens, a Large-Language Model (LLM)-assistant retrieval procedure with clustering-based pruning to obtain dominating property descriptions, and a new contrastive learning strategy for property-token learning. The superior performances on the 11 widely used datasets demonstrate that our investigation of dominating properties advances discriminative class-specific representation learning and few-shot classification.