Diffusion Instruction Tuning

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the high fine-tuning cost, large data requirements, and poor generalization of advanced vision-language models (VLMs) on in-distribution and out-of-distribution tasks, this paper proposes Lavender—a lightweight supervised fine-tuning method. Its core innovation is the first direct injection of Stable Diffusion’s visual priors into the cross-modal attention layers of VLMs via an attention alignment mechanism, enabling joint text-visual representation distillation. Lavender departs from conventional multi-stage adaptation paradigms, requiring only 0.13M samples and one-day training on eight GPUs. It achieves up to a 30% average performance gain on open-source VLMs such as Llama-3.2-11B, and a remarkable 68% improvement on challenging out-of-distribution medical question answering. This significantly reduces computational and data dependencies while enhancing cross-distribution visual understanding capability.

Technology Category

Application Category

📝 Abstract

We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.

Problem

Research questions and friction points this paper is trying to address.

Enhance vision-language model performance

Align text-vision attention with image generation

Improve accuracy in out-of-distribution tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Stable Diffusion models

Aligns text-vision attention mechanisms

Requires minimal training examples

🔎 Similar Papers

No similar papers found.