๐ค AI Summary
Vision-language models (VLMs) suffer from limited few-shot adaptation performance under distribution shiftsโexisting parameter-efficient fine-tuning (PEFT) methods rely on fixed, hand-crafted prompts with insufficient semantic expressivity, while image-induced prompting incurs substantial inference overhead.
Method: We propose Auxiliary Description Knowledge (ADK), a parameter-free framework that leverages large language models to generate class-level semantic descriptions offline. ADK dynamically injects these descriptions into textual representations via compositional and instance-level dual pathways, and integrates them through non-parametric attention and feature averaging for plug-and-play enhancement.
Contribution/Results: ADK is the first approach to achieve zero-parameter, low-overhead, high-semantic description knowledge integration. It consistently outperforms state-of-the-art PEFT methods across diverse few-shot transfer benchmarks, establishing new SOTA results on multiple datasets.
๐ Abstract
Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.