Discriminative Fine-tuning of LVLMs

📅 2024-12-05
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses two key limitations: the weak discriminative capability of Large Vision-Language Models (LVLMs) and the insufficient language understanding and compositional reasoning of CLIP-style models. To this end, we propose the first fine-tuning framework explicitly designed to enhance discriminative ability in LVLMs. Our method jointly optimizes contrastive and autoregressive objectives, employs a multi-granularity image-text pair training strategy, and achieves parameter-efficient adaptation via synergistic soft prompting and LoRA. Crucially, it preserves the model’s generative capacity while substantially improving discriminative performance. Experiments demonstrate that our approach surpasses same-scale CLIP models on standard image-text retrieval benchmarks and achieves significant gains on compositional tasks—including VQA and visual reasoning—validating its effectiveness in strengthening deep language comprehension and structured reasoning.

Technology Category

Application Category

📝 Abstract
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a"bag of words"behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine"the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LVLMs for discriminative vision-language tasks
Combining contrastive and generative training for better performance
Improving language understanding and compositionality in vision models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discriminative fine-tuning of LVLMs for enhanced capabilities
Combines contrastive and next-token prediction losses
Uses soft prompting and LoRA for parameter efficiency
🔎 Similar Papers
No similar papers found.
Yassine Ouali
Yassine Ouali
Samsung AI Cambridge
Machine LearningDeep Learning
Adrian Bulat
Adrian Bulat
Samsung AI Cambridge
Computer VisionDeep LearningMachine LearningArtificial Intelligence
Alexandros Xenos
Alexandros Xenos
Ph.D. student, Queen Mary University of London
Deep LearningMultimodal Deep LearningNLP
A
Anestis Zaganidis
Samsung AI Cambridge
I
Ioannis Maniadis Metaxas
Samsung AI Cambridge
G
Georgios Tzimiropoulos
Samsung AI Cambridge, Queen Mary University of London
B
Brais Martínez
Samsung AI Cambridge