Visual Instruction Bottleneck Tuning

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from degraded robustness to out-of-distribution (OOD) queries under distribution shifts. Method: This paper proposes Vision-Instructed Bottleneck Tuning (VIB-Tuning), the first approach to incorporate information bottleneck theory into MLLM robustness optimization. Grounded in variational inference, VIB-Tuning formulates a vision–language representation compression objective and derives a theoretically interpretable lower bound on robustness—requiring neither additional annotations nor model scaling. Contribution/Results: Evaluated across 45 datasets spanning 30 distinct distribution shift scenarios—including open- and closed-ended QA and hallucination detection—VIB-Tuning consistently enhances cross-distribution generalization for three major MLLM architectures. It achieves minimally sufficient representation learning, establishing a new lightweight, interpretable paradigm for enhancing MLLM robustness.

Technology Category

Application Category

📝 Abstract
Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by the information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of three MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM's robustness under shifts by pursuing the learning of a minimal sufficient representation.
Problem

Research questions and friction points this paper is trying to address.

Enhances MLLM robustness under distribution shifts
Reduces need for extra data or larger models
Improves performance via minimal sufficient representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Instruction Bottleneck Tuning (Vittle)
Minimal sufficient representation learning
Information bottleneck principle application
🔎 Similar Papers
No similar papers found.