ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

📅 2024-06-17

📈 Citations: 2

✨ Influential: 1

career value

192K/year

🤖 AI Summary

Video large language models (VLMMs) suffer from pervasive “language-first bias” and vision-language misalignment—over-relying on textual cues while neglecting visual content—leading to verbose, hallucinated, and visually unfaithful responses. Method: We propose Iterative Self-Reflection Direct Preference Optimization (IS-DPO), the first DPO framework adapted to video multimodal settings. It introduces a novel video-text joint attention self-reflection mechanism, integrates visual saliency-guided preference sampling, and employs multi-stage self-reflective reward modeling to achieve fine-grained cross-modal alignment. Contribution/Results: IS-DPO significantly outperforms state-of-the-art methods across multiple video question-answering benchmarks: improving visual relevance by 32.7% and reducing redundant descriptions by 41.5%. To foster reproducibility and further research, we publicly release our code, models, and datasets.

Technology Category

Application Category

📝 Abstract

Iterative self-improvement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multi-modal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance preference modeling. This approach enhances the self-judge's focus on informative video regions, resulting in more visually grounded preferences. In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art. We are committed to open-sourcing our code, models, and datasets to encourage further investigation.

Problem

Research questions and friction points this paper is trying to address.

Video-Text Bias

Multimodal Large Models

Video Question Answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

ISR-DPO

Video Multimodal Large Models

Video Question Answering

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs