Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited controllability in medical image generation caused by the large modality gap and semantic entanglement between text and images. To this end, the authors propose a vision-guided textual disentanglement framework that introduces, for the first time, a cross-modal latent alignment mechanism to decompose unstructured clinical text into disentangled semantic representations—such as anatomical structure and imaging style. These disentangled features are then integrated into a Diffusion Transformer (DiT) architecture via a Hybrid Feature Fusion Module (HFFM), enabling fine-grained structural control during image synthesis. Experimental results on three medical imaging datasets demonstrate that the proposed method significantly outperforms existing approaches, achieving not only higher image generation quality but also improved performance on downstream classification tasks.

Technology Category

Application Category

📝 Abstract
Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.
Problem

Research questions and friction points this paper is trying to address.

medical image generation
modality gap
semantic entanglement
fine-grained control
text-to-image synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic disentanglement
visually-guided generation
medical image synthesis
diffusion transformer
cross-modal alignment
🔎 Similar Papers
No similar papers found.
X
Xin Huang
Computer Science and Engineering, Northeastern University, Shenyang, China; Key Laboratory of Intelligent Computing in Medical Image of Ministry of Education, Northeastern University, Shenyang, China
J
Junjie Liang
Computer Science and Engineering, Northeastern University, Shenyang, China; Key Laboratory of Intelligent Computing in Medical Image of Ministry of Education, Northeastern University, Shenyang, China
Qingshan Hou
Qingshan Hou
Northeastern University; National University of Singapore
medical image analysisfoundation modeldeep learning
Peng Cao
Peng Cao
Northeastern University
Data miningMachine learninig
J
Jinzhu Yang
Computer Science and Engineering, Northeastern University, Shenyang, China; Key Laboratory of Intelligent Computing in Medical Image of Ministry of Education, Northeastern University, Shenyang, China; National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Shenyang, China
X
Xiaoli Liu
AiShiWeiLai AI Research, China
O
Osmar R. Zaiane
Amii, University of Alberta, Edmonton, Alberta, Canada