Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

📅 2024-05-24

🏛️ arXiv.org

📈 Citations: 20

✨ Influential: 3

career value

203K/year

🤖 AI Summary

Current large vision-language models (LVLMs) suffer from inadequate cross-modal alignment, and prevailing approaches rely on external models or datasets—leading to uncontrolled, unstable alignment. This work proposes SIMA, an external-dependency-free self-improvement framework that leverages only instruction-tuning data to autonomously generate responses and construct preference pairs. Its core innovation is the first-ever in-context self-critique mechanism, integrating three novel visual criteria to enable the LVLM itself to serve as a critic—without additional fine-tuning or external models. SIMA further unifies self-supervised preference learning with multi-dimensional visual metrics. Experiments demonstrate that SIMA significantly outperforms state-of-the-art methods across 14 hallucination and comprehensive benchmarks, achieving substantial improvements in both cross-modal alignment capability and generalization.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there remains significant room for improvement in aligning visual and language modalities. Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results. In this paper, we propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies. SIMA leverages existing vision instruction tuning datasets to self-generate responses, incorporating an in-context self-critic mechanism that constructs preference pairs for tuning. Crucially, our approach allows LVLMs to act as critics by designing effective critic prompts, eliminating the need for additional fine-tuning with external instruction data. We introduce three novel visual metrics within the self-critic process to guide judgment, significantly improving the accuracy of self-critic. Through extensive experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA significantly improves LVLM's performance and outperforms previous approaches, achieving superior modality alignment.

Problem

Research questions and friction points this paper is trying to address.

Enhance visual-language modality alignment

Eliminate external model dependencies

Improve self-critic accuracy in LVLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-improvement framework SIMA

In-context self-critic mechanism

Three novel visual metrics

🔎 Similar Papers

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs