DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in radiology report generation: modality incompleteness (e.g., missing clinical information) and feature entanglement (where modality-specific and shared representations are conflated, leading to hallucination). To this end, we propose a disentangled alignment framework for vision–language generation. Methodologically, we design a Mixture-of-Experts (MoE)-structured vision–language variational autoencoder with orthogonality constraints to explicitly separate modality-specific and shared latent representations; a lightweight LLaMA-X decoder ensures generation fidelity. Our core contribution is a novel disentangled alignment constraint mechanism that effectively mitigates cross-modal confusion and medical factuality hallucinations while maintaining robustness under modality incompleteness. Experiments on IU X-Ray and MIMIC-CXR yield BLEU@4 scores of 0.266 and 0.134, respectively—substantially outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
The integration of medical images with clinical context is essential for generating accurate and clinically interpretable radiology reports. However, current automated methods often rely on resource-heavy Large Language Models (LLMs) or static knowledge graphs and struggle with two fundamental challenges in real-world clinical data: (1) missing modalities, such as incomplete clinical context , and (2) feature entanglement, where mixed modality-specific and shared information leads to suboptimal fusion and clinically unfaithful hallucinated findings. To address these challenges, we propose the DiA-gnostic VLVAE, which achieves robust radiology reporting through Disentangled Alignment. Our framework is designed to be resilient to missing modalities by disentangling shared and modality-specific features using a Mixture-of-Experts (MoE) based Vision-Language Variational Autoencoder (VLVAE). A constrained optimization objective enforces orthogonality and alignment between these latent representations to prevent suboptimal fusion. A compact LLaMA-X decoder then uses these disentangled representations to generate reports efficiently. On the IU X-Ray and MIMIC-CXR datasets, DiA has achieved competetive BLEU@4 scores of 0.266 and 0.134, respectively. Experimental results show that the proposed method significantly outperforms state-of-the-art models.
Problem

Research questions and friction points this paper is trying to address.

Handles missing clinical data modalities in radiology reporting
Addresses feature entanglement between shared and modality-specific information
Generates robust radiology reports using disentangled alignment framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts VLVAE disentangles shared and modality features
Constrained optimization ensures orthogonality and alignment of representations
Compact LLaMA-X decoder generates reports using disentangled representations
N
Nagur Shareef Shaik
Department of Computer Science, Georgia State University, Atlanta, GA, USA
T
T. Cherukuri
Department of Computer Science, Georgia State University, Atlanta, GA, USA
A
Adnan Masood
UST, Aliso Viejo, CA, USA
Dong Hye Ye
Dong Hye Ye
Assistant Professor, Georgia State University, TReNDS Center
Image ProcessingMachine LearningComputational ImagingMedical Image Analysis