Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing medical generative models are constrained by unimodal architectures, hindering effective integration of heterogeneous multimodal data—including medical imaging, histopathology, and clinical text—and impeding the development of general-purpose biomedical foundation models. To address this, we propose MeDiM, the first discrete diffusion model for comprehensive medical multimodality. MeDiM adopts a unified multimodal large language model (MLLM) backbone—eliminating modality-specific components—and enables bidirectional co-generation across imaging, pathology, and text within a shared discrete probabilistic space via non-causal attention masking and continuous timestep embeddings. Evaluated on MIMIC-CXR and PathGen, MeDiM achieves FID scores of 16.60 and 24.19, respectively. In joint image–report generation, it improves BLEU-3 by 31.58% and METEOR by 4.80%, demonstrating significantly enhanced cross-modal reasoning capability.

Technology Category

Application Category

📝 Abstract

Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

Problem

Research questions and friction points this paper is trying to address.

Integrating complementary evidence from diverse medical modalities

Overcoming fragmentation in generative medical model development

Unifying multimodal generation tasks through shared probabilistic space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical discrete diffusion model unifying multiple modalities

Multimodal large language model as diffusion backbone

Bidirectional context and timestep embeddings for generation

🔎 Similar Papers

MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis