Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work proposes the first unified latent diffusion framework for jointly generating multimodal medical data—specifically MRI images and heterogeneous clinical tabular data—with consistent anatomical and clinical attributes. The approach leverages a variational autoencoder to construct a shared latent space, incorporates cross-attention mechanisms to guide the diffusion process, and employs modality-specific decoders to reconstruct high-quality MRI and tabular outputs separately. Evaluated on a cohort of over 10,000 participants from the NAKO study, the generated data demonstrate strong clinical fidelity: MRIs exhibit anatomical plausibility, while synthetic tabular data surpass CTGAN and match TVAE in quality. Quantitative assessments via Fréchet distance and precision-recall metrics further confirm the high fidelity of the generated samples, offering a promising pathway toward medical digital twins.

📝 Abstract

We propose a multimodal latent diffusion model that jointly synthesizes volumetric magnetic resonance imaging (MRI) and tabular clinical data within a shared latent space via cross-attention. This approach enables coherent joint representation learning of MRI and tabular modalities for generative modeling. Our model utilizes a variational autoencoder to fuse the two modalities before diffusion-based synthesis, allowing modality-appropriate reconstruction with separate decoders for MRI and tabular data. We evaluated the framework on data from the German National Cohort (NAKO Gesundheitsstudie), comprising over 10,000 participants with MRI scans and clinical tabular features such as age, sex, body measurements, and ethnicity. The generated MRI volumes exhibited anatomical plausibility and body composition consistent with the synthesized tabular attributes. Quantitative evaluation using Fréchet distance and precision-recall metrics confirmed high-fidelity image generation. In the tabular modality, our model outperformed CTGAN across standard evaluation metrics and achieved results comparable to TVAE, demonstrating competitive performance relative to established unimodal baselines. This work is, to our knowledge, the first to demonstrate the feasibility of jointly modeling MRI and mixed-type tabular data in a single latent diffusion framework, offering a proof-of-concept for generating coherent synthetic multimodal patient data and aligning with the broader goal of developing digital twins in healthcare.

Problem

Research questions and friction points this paper is trying to address.

multimodal synthesis

MRI

tabular data

latent space

digital twins

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal diffusion

joint latent space

cross-attention