Tree of Attributes Prompt Learning for Vision-Language Models

📅 2024-10-15
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language prompt learning methods merely concatenate learnable prompts with class names, neglecting the rich semantic context embedded in class names. Method: We propose TreePrompt, which (1) leverages large language models to generate a hierarchical “concept–attribute–description” knowledge tree to explicitly model fine-grained visual attributes; (2) introduces a vision-conditioned text pooling module to achieve instance-level image–text prompt alignment; and (3) elevates prompt learning into interpretable, structured domain-expert modeling. Our approach integrates tree-structured knowledge distillation, hierarchical vision–language co-learning, and structured knowledge graph embedding. Contribution/Results: TreePrompt achieves state-of-the-art performance across 11 benchmarks, delivering significant improvements in zero-shot base-to-novel class generalization, cross-dataset transfer, and few-shot classification.

Technology Category

Application Category

📝 Abstract
Prompt learning has proven effective in adapting vision language models for downstream tasks. However, existing methods usually append learnable prompt tokens solely with the category names to obtain textual features, which fails to fully leverage the rich context indicated in the category name. To address this issue, we propose the Tree of Attributes Prompt learning (TAP), which first instructs LLMs to generate a tree of attributes with a"concept - attribute - description"structure for each category, and then learn the hierarchy with vision and text prompt tokens. Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs associated with class names from LLMs. Furthermore, our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts. Additionally, the general and diverse descriptions generated based on the class names may be wrong or absent in the specific given images. To address this misalignment, we further introduce a vision-conditional pooling module to extract instance-specific text features. Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods on the zero-shot base-to-novel generalization, cross-dataset transfer, as well as few-shot classification across 11 diverse datasets. Code is available at https://github.com/HHenryD/TAP.
Problem

Research questions and friction points this paper is trying to address.

Enhances vision-language models with structured attribute trees
Improves text and vision prompt learning for attributes
Addresses misalignment with vision-conditional text feature extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates attribute trees with LLMs
Learns hierarchy with vision-text prompts
Uses vision-conditional pooling module
🔎 Similar Papers
No similar papers found.