🤖 AI Summary
This work addresses two key limitations: the weak discriminative capability of Large Vision-Language Models (LVLMs) and the insufficient language understanding and compositional reasoning of CLIP-style models. To this end, we propose the first fine-tuning framework explicitly designed to enhance discriminative ability in LVLMs. Our method jointly optimizes contrastive and autoregressive objectives, employs a multi-granularity image-text pair training strategy, and achieves parameter-efficient adaptation via synergistic soft prompting and LoRA. Crucially, it preserves the model’s generative capacity while substantially improving discriminative performance. Experiments demonstrate that our approach surpasses same-scale CLIP models on standard image-text retrieval benchmarks and achieves significant gains on compositional tasks—including VQA and visual reasoning—validating its effectiveness in strengthening deep language comprehension and structured reasoning.
📝 Abstract
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a"bag of words"behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine"the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.