🤖 AI Summary
Existing medical generative models are constrained by unimodal architectures, hindering effective integration of heterogeneous multimodal data—including medical imaging, histopathology, and clinical text—and impeding the development of general-purpose biomedical foundation models. To address this, we propose MeDiM, the first discrete diffusion model for comprehensive medical multimodality. MeDiM adopts a unified multimodal large language model (MLLM) backbone—eliminating modality-specific components—and enables bidirectional co-generation across imaging, pathology, and text within a shared discrete probabilistic space via non-causal attention masking and continuous timestep embeddings. Evaluated on MIMIC-CXR and PathGen, MeDiM achieves FID scores of 16.60 and 24.19, respectively. In joint image–report generation, it improves BLEU-3 by 31.58% and METEOR by 4.80%, demonstrating significantly enhanced cross-modal reasoning capability.
📝 Abstract
Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.