Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the challenge of aligning clinical phrases from radiology reports with corresponding anatomical regions in chest X-ray images—where text-conditioned diffusion models struggle without pixel-level supervision—this paper proposes an anatomy-guided weakly supervised prompt tuning framework. Our method is the first to incorporate anatomical prior knowledge into prompt learning, achieved through anatomical knowledge distillation, contrastive prompt optimization, and report-image weak supervision for phrase localization. Crucially, it requires no pixel-level annotations. Extensive experiments demonstrate significant improvements in multimodal semantic alignment and generalization, achieving new state-of-the-art performance on the in-distribution MS-CXR benchmark and maintaining robust cross-domain performance on the out-of-distribution VinDr-CXR dataset. The framework thus bridges clinical semantics and anatomical structure in a scalable, annotation-efficient manner. Code will be publicly released.

Technology Category

Application Category

📝 Abstract

Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Improves text-image alignment in chest X-ray diffusion models

Enables weakly supervised downstream tasks in medical imaging

Addresses data scarcity in medical text-to-image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anatomy-grounded weakly supervised prompt tuning

Fine-tuning for multi-modal alignment improvement

State-of-the-art performance on MS-CXR benchmark

🔎 Similar Papers

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models