Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-driven CT generation methods often lack anatomical guidance, leading to spatial blurriness or anatomical inconsistencies. To address this limitation, this work proposes a retrieval-augmented diffusion framework that leverages a 3D vision-language encoder to retrieve semantically relevant clinical cases from a database. The anatomical annotations of these retrieved cases serve as structural priors, which are integrated into a latent diffusion model via ControlNet to harmonize textual semantics with anatomical fidelity. Notably, this approach achieves spatial controllability and anatomical plausibility in text-to-CT synthesis without requiring ground-truth anatomical labels during inference. Evaluated on the CT-RATE dataset, the method demonstrates significant improvements in both image fidelity and clinical consistency compared to existing approaches.

Technology Category

Application Category

📝 Abstract
Text-conditioned generative models for volumetric medical imaging provide semantic control but lack explicit anatomical guidance, often resulting in outputs that are spatially ambiguous or anatomically inconsistent. In contrast, structure-driven methods ensure strong anatomical consistency but typically assume access to ground-truth annotations, which are unavailable when the target image is to be synthesized. We propose a retrieval-augmented approach for Text-to-CT generation that integrates semantic and anatomical information under a realistic inference setting. Given a radiology report, our method retrieves a semantically related clinical case using a 3D vision-language encoder and leverages its associated anatomical annotation as a structural proxy. This proxy is injected into a text-conditioned latent diffusion model via a ControlNet branch, providing coarse anatomical guidance while maintaining semantic flexibility. Experiments on the CT-RATE dataset show that retrieval-augmented generation improves image fidelity and clinical consistency compared to text-only baselines, while additionally enabling explicit spatial controllability, a capability inherently absent in such approaches. Further analysis highlights the importance of retrieval quality, with semantically aligned proxies yielding consistent gains across all evaluation axes. This work introduces a principled and scalable mechanism to bridge semantic conditioning and anatomical plausibility in volumetric medical image synthesis. Code will be released.
Problem

Research questions and friction points this paper is trying to address.

Text-to-CT generation
anatomical guidance
semantic control
volumetric medical imaging
spatial ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
Anatomical Guidance
Text-to-CT Synthesis
Latent Diffusion Model
ControlNet
🔎 Similar Papers
No similar papers found.