Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Vision-language models (VLMs) suffer from limited few-shot adaptation performance under distribution shifts—existing parameter-efficient fine-tuning (PEFT) methods rely on fixed, hand-crafted prompts with insufficient semantic expressivity, while image-induced prompting incurs substantial inference overhead. Method: We propose Auxiliary Description Knowledge (ADK), a parameter-free framework that leverages large language models to generate class-level semantic descriptions offline. ADK dynamically injects these descriptions into textual representations via compositional and instance-level dual pathways, and integrates them through non-parametric attention and feature averaging for plug-and-play enhancement. Contribution/Results: ADK is the first approach to achieve zero-parameter, low-overhead, high-semantic description knowledge integration. It consistently outperforms state-of-the-art PEFT methods across diverse few-shot transfer benchmarks, establishing new SOTA results on multiple datasets.

Technology Category

Application Category

📝 Abstract

Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhance few-shot adaptation of vision-language models with auxiliary descriptive knowledge

Address insufficient semantics from fixed prompts in parameter-efficient fine-tuning

Provide rich class descriptions without computational overhead at inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates descriptive prompts offline using LLM

Uses compositional and instance-specific knowledge

Parameter-free plug-and-play component for PEFT

🔎 Similar Papers

No similar papers found.