Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses key challenges in aligning large vision-language models (LVLMs) with human values and task preferences—namely, poor alignment fidelity, low sample efficiency, and policy instability. We propose an end-to-end alignment framework that synergistically integrates deep reinforcement learning (DRL) with direct preference optimization (DPO), eliminating explicit reward modeling and instead optimizing the policy directly from human preference data to enable fine-grained, adaptive alignment in multimodal interactions. Our key innovation lies in incorporating DPO into the joint vision-language training paradigm, thereby enhancing LVLMs’ generalization and continual learning capabilities on complex cross-modal tasks. Extensive experiments demonstrate significant improvements in sample efficiency and policy stability, achieving state-of-the-art performance across multiple vision-language understanding and generation benchmarks. These results validate the method’s effectiveness, robustness, and scalability.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While large-scale pretraining has driven substantial progress, fine-tuning these models for aligning with human values or engaging in specific tasks or behaviors remains a critical challenge. Deep Reinforcement Learning (DRL) and Direct Preference Optimization (DPO) offer promising frameworks for this aligning process. While DRL enables models to optimize actions using reward signals instead of relying solely on supervised preference data, DPO directly aligns the policy with preferences, eliminating the need for an explicit reward model. This overview explores paradigms for fine-tuning LVLMs, highlighting how DRL and DPO techniques can be used to align models with human preferences and values, improve task performance, and enable adaptive multimodal interaction. We categorize key approaches, examine sources of preference data, reward signals, and discuss open challenges such as scalability, sample efficiency, continual learning, generalization, and safety. The goal is to provide a clear understanding of how DRL and DPO contribute to the evolution of robust and human-aligned LVLMs.

Problem

Research questions and friction points this paper is trying to address.

Aligning Large Vision-Language Models with human values

Improving task performance through Deep Reinforcement Learning

Enabling adaptive multimodal interaction via Direct Preference Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Reinforcement Learning for reward-based alignment

Direct Preference Optimization without reward model

Fine-tuning LVLMs with human preference data

🔎 Similar Papers

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data