Discriminative Fine-tuning of LVLMs

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

214K/year

🤖 AI Summary

This work addresses two key limitations: the weak discriminative capability of Large Vision-Language Models (LVLMs) and the insufficient language understanding and compositional reasoning of CLIP-style models. To this end, we propose the first fine-tuning framework explicitly designed to enhance discriminative ability in LVLMs. Our method jointly optimizes contrastive and autoregressive objectives, employs a multi-granularity image-text pair training strategy, and achieves parameter-efficient adaptation via synergistic soft prompting and LoRA. Crucially, it preserves the model’s generative capacity while substantially improving discriminative performance. Experiments demonstrate that our approach surpasses same-scale CLIP models on standard image-text retrieval benchmarks and achieves significant gains on compositional tasks—including VQA and visual reasoning—validating its effectiveness in strengthening deep language comprehension and structured reasoning.

Technology Category

Application Category

📝 Abstract

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a"bag of words"behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine"the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LVLMs for discriminative vision-language tasks

Combining contrastive and generative training for better performance

Improving language understanding and compositionality in vision models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discriminative fine-tuning of LVLMs for enhanced capabilities

Combines contrastive and next-token prediction losses

Uses soft prompting and LoRA for parameter efficiency

🔎 Similar Papers

No similar papers found.