Verbalized Representation Learning for Interpretable Few-Shot Generalization

📅 2024-11-27
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving interpretability and generalization in few-shot object recognition, this paper proposes an end-to-end vision-language collaborative interpretable representation learning framework. Leveraging pre-trained vision-language models (VLMs), the method automatically discovers inter-class discriminative and intra-class common semantic features, generating human-readable, interpretable semantic representations in natural language form; these are then mapped to numerical vectors via a semantic alignment mechanism for downstream classification—without requiring manually annotated attributes. The key contribution is the first automatic, end-to-end learning and structured embedding of interpretable semantic features under few-shot settings. Experiments demonstrate that, at comparable model scale, our method achieves a 24% accuracy improvement over state-of-the-art approaches using only 5% of the training data; moreover, it outperforms manual attribute-based methods by 20% in downstream classification performance.

Technology Category

Application Category

📝 Abstract
Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: https://github.com/joeyy5588/VRL/tree/main.
Problem

Research questions and friction points this paper is trying to address.

Develop interpretable features for few-shot object recognition
Capture class differences and commonalities via natural language
Improve model generalization with minimal data and smaller models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Model for feature extraction
Maps verbalized features to numeric vectors
Achieves high accuracy with less data
🔎 Similar Papers
No similar papers found.
Cheng-Fu Yang
Cheng-Fu Yang
UCLA
Multimodal LearningVision and Language
Da Yin
Da Yin
Meta FAIR
Natural Language Processing
W
Wenbo Hu
University of California, Los Angeles
N
Nanyun Peng
University of California, Los Angeles
Bolei Zhou
Bolei Zhou
Associate Professor at UCLA
Computer VisionRoboticsArtificial Intelligence
K
Kai-Wei Chang
University of California, Los Angeles