🤖 AI Summary
Generative large multimodal models (LMMs) struggle to directly adapt to discriminative vision-language tasks—e.g., image classification and multiple-choice visual question answering—without task-specific fine-tuning.
Method: We propose Sparse Attention Vectors (SAVs), a parameter-free adaptation method that identifies and leverages <1% of the most discriminative sparse attention head activations within LMMs as high-quality multimodal features. SAVs employ latent-space feature distillation followed by few-shot linear classification, requiring no model parameter updates.
Contribution/Results: SAVs achieve state-of-the-art performance across diverse discriminative benchmarks (e.g., ImageNet, VQAv2, OK-VQA, and MME), significantly outperforming few-shot transfer and lightweight fine-tuning baselines. The approach is architecture-agnostic (validated on LLaVA, Qwen-VL, etc.), task-transferable, robust to limited supervision (only a few labeled samples per class), and exhibits consistent performance gains with increasing sample size—demonstrating effectiveness, robustness, and scalability.
📝 Abstract
Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.