LaViDa: A Large Diffusion Language Model for Multimodal Understanding

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the slow inference speed and poor formatting controllability of autoregressive vision-language models (e.g., LLaVA), this work proposes the first discrete diffusion-based multimodal large language model. Methodologically, it integrates a ViT visual encoder with a discrete diffusion backbone and introduces three key innovations: complementary masked training, prefix KV-caching for efficient inference, and timestep-offset sampling. This constitutes the first systematic transfer of diffusion models’ parallel decoding, bidirectional context modeling, and strong structural constraint capabilities to multimodal understanding and generation tasks. Experiments demonstrate competitive or superior performance against autoregressive baselines on benchmarks including MMMU; achieve a +4.1 CIDEr gain on COCO Captioning with 1.92× inference acceleration; and improve constrained poetry completion accuracy by 59%.

Technology Category

Application Category

📝 Abstract

Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal understanding with faster inference in VLMs

Improving controllable generation in vision-language models

Exploring discrete diffusion models for multimodal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete diffusion models for multimodal understanding

Complementary masking for effective training

Prefix KV cache for efficient inference

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs