Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing large vision-language models (LVLMs) rely on image-based visual instruction tuning, hindering seamless transfer of task-solving capabilities from their underlying large language models (LLMs) and necessitating costly construction of large-scale multimodal datasets. This work proposes ViFT, the first framework enabling **image-free visual instruction tuning**: it decouples task-solving competence—learned solely from text instructions—and visual perception—acquired via image caption modeling—through cross-modal representation disentanglement and dynamic fusion. By jointly optimizing these complementary capabilities, ViFT bridges the capability transfer gap between LVLMs and their backbone LLMs while drastically reducing reliance on high-quality multimodal data. Empirical evaluation on visual reasoning and visual instruction-following benchmarks demonstrates that ViFT achieves state-of-the-art performance with significantly less training data compared to conventional approaches.

Technology Category

Application Category

📝 Abstract

Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale dataset. To address it, we propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. In ViFT, we only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. During inference, we extract and combine the representations of the text and image inputs, for fusing the two abilities to fulfill multimodal tasks. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks, with rather less training data. Our code and data will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

Visual instruction-free fine-tuning for LVLMs

Reducing training data requirement

Enhancing multimodal task-solving capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual instruction-free fine-tuning

text-only instructions training

separate task-visual learning fusion

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling