LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering

📅 2024-12-16

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address the modality imbalance between vision and language in multimodal large language models (MLLMs)—which impedes visual instruction tuning efficiency and inflates parameter requirements—this paper proposes Modality Linear Representation-Steering (MoReS). MoReS systematically calibrates cross-modal representations via layer-wise linear transformations within the visual subspace, enabling dynamic alignment of vision and language modalities. It is the first method to empirically reveal and quantify intrinsic modality imbalance in MLLMs, achieving efficient reweighting and recalibration with minimal trainable parameters. On three visual understanding benchmarks and eight VQA tasks, MoReS matches LoRA’s performance while using only 0.2% of its trainable parameters—achieving 500× parameter compression without accuracy degradation. To foster reproducibility and customization, we open-source the LLaVA Steering Factory platform, which supports modality diagnostics and modular steering component design.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required, inspiring a direction for further optimizing visual instruction tuning. We introduce Modality Linear Representation-Steering (MoReS) to achieve the goal. MoReS effectively re-balances the intrinsic modalities throughout the model, where the key idea is to steer visual representations through linear transformations in the visual subspace across each model layer. To validate our solution, we composed LLaVA Steering, a suite of models integrated with the proposed MoReS method. Evaluation results show that the composed LLaVA Steering models require, on average, 500 times fewer trainable parameters than LoRA needs while still achieving comparable performance across three visual benchmarks and eight visual question-answering tasks. Last, we present the LLaVA Steering Factory, an in-house developed platform that enables researchers to quickly customize various MLLMs with component-based architecture for seamlessly integrating state-of-the-art models, and evaluate their intrinsic modality imbalance.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Language Models

Visual Understanding

Parameter Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Fine-Tuning

Parameter Reduction

LLaVA Steering Factory

🔎 Similar Papers

Law of Vision Representation in MLLMs