Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the limited controllability in medical image generation caused by the large modality gap and semantic entanglement between text and images. To this end, the authors propose a vision-guided textual disentanglement framework that introduces, for the first time, a cross-modal latent alignment mechanism to decompose unstructured clinical text into disentangled semantic representations—such as anatomical structure and imaging style. These disentangled features are then integrated into a Diffusion Transformer (DiT) architecture via a Hybrid Feature Fusion Module (HFFM), enabling fine-grained structural control during image synthesis. Experimental results on three medical imaging datasets demonstrate that the proposed method significantly outperforms existing approaches, achieving not only higher image generation quality but also improved performance on downstream classification tasks.

Technology Category

Application Category

📝 Abstract

Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.

Problem

Research questions and friction points this paper is trying to address.

medical image generation

modality gap

semantic entanglement

fine-grained control

text-to-image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic disentanglement

visually-guided generation

medical image synthesis