Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the core challenge of scarce training data—limited textual descriptions and small-scale image samples—in medical text-to-image generation, this paper proposes a high-fidelity 2D medical image synthesis framework tailored for low-data regimes. Methodologically, it (1) introduces a hybrid-level diffusion fine-tuning strategy jointly optimizing pixel-space and latent-space reconstruction; (2) leverages a vision-language model (VLM) to automatically generate high-quality, clinically relevant image captions, alleviating the annotation bottleneck; and (3) adapts the pre-trained PixArt-α diffusion transformer to the medical domain via architectural and semantic alignment. Evaluated on two public medical imaging benchmarks, our approach outperforms state-of-the-art methods, achieving 12.6%–28.4% improvements in FID and KID scores. Moreover, generated images significantly enhance downstream classification accuracy, demonstrating strong clinical relevance and practical utility.

Technology Category

Application Category

📝 Abstract

Text-to-image generative models have achieved remarkable breakthroughs in recent years. However, their application in medical image generation still faces significant challenges, including small dataset sizes, and scarcity of medical textual data. To address these challenges, we propose Med-Art, a framework specifically designed for medical image generation with limited data. Med-Art leverages vision-language models to generate visual descriptions of medical images which overcomes the scarcity of applicable medical textual data. Med-Art adapts a large-scale pre-trained text-to-image model, PixArt-$α$, based on the Diffusion Transformer (DiT), achieving high performance under limited data. Furthermore, we propose an innovative Hybrid-Level Diffusion Fine-tuning (HLDF) method, which enables pixel-level losses, effectively addressing issues such as overly saturated colors. We achieve state-of-the-art performance on two medical image datasets, measured by FID, KID, and downstream classification performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses medical image generation with limited data

Overcomes scarcity of medical textual descriptions

Improves image quality and avoids color saturation issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Diffusion Transformer for medical imaging

Generates visual descriptions with vision-language models

Implements Hybrid-Level Diffusion Fine-tuning method

🔎 Similar Papers

MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis