LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) predominantly rely on autoregressive architectures, limiting their flexibility and scalability. This work introduces LLaDA-V—the first pure diffusion-based MLLM—addressing this constraint. Methodologically, it extends the LLaDA language diffusion framework by integrating a ViT visual encoder and an MLP-based cross-modal projector; visual features are directly mapped into the language embedding space via vision-instruction fine-tuning, enabling end-to-end multimodal understanding and generation. Key contributions include: (i) the first empirical validation of the feasibility of a purely diffusion-based paradigm for MLLMs; (ii) decoupling performance from strong autoregressive language modeling, thereby balancing multimodal capability with superior data scalability; and (iii) achieving state-of-the-art results on multimodal understanding benchmarks—matching LLaMA3-V’s performance under identical instruction-tuning data, substantially narrowing the gap with Qwen2-VL, and outperforming existing hybrid and diffusion-only MLLMs.

Technology Category

Application Category

📝 Abstract

In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.

Problem

Research questions and friction points this paper is trying to address.

Develops LLaDA-V, a diffusion-based multimodal model for visual-language tasks

Integrates vision encoder with language diffusion for effective multimodal alignment

Evaluates performance against autoregressive models in multimodal understanding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based Multimodal Large Language Model

Visual instruction tuning with masked diffusion

Vision encoder and MLP connector integration

🔎 Similar Papers

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision