S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing radiology report generation (RRG) methods rely on multimodal large language models (MLLMs) for image–text instance-level alignment, neglecting fine-grained anatomical correspondence—leading to templated, clinically implausible reports. To address this, we propose S2D-Align, a novel paradigm for anatomy-aware cross-modal alignment. S2D-Align introduces a multi-stage auxiliary learning framework that progressively integrates multi-granularity supervisory signals: image–report pairing, reference-report guidance, and key-phrase grounding. Furthermore, it incorporates a memory adapter to enable feature sharing and knowledge transfer from coarse- to fine-grained anatomical representations. Evaluated on MIMIC-CXR and IU X-Ray, S2D-Align achieves state-of-the-art performance in both automatic metrics and clinical validity. Ablation studies confirm that each component significantly improves report accuracy and anatomical fidelity, demonstrating the efficacy of explicit anatomical structure modeling in RRG.

Technology Category

Application Category

📝 Abstract
Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose extsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. extsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public extsc{MIMIC-CXR} and extsc{IU X-Ray} benchmarks, where extsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Improves anatomically-grounded alignment in radiology report generation
Addresses sub-optimal quality from template-based report generation methods
Enhances cross-modal alignment between radiographs and diagnostic reports
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shallow-to-deep strategy for anatomically-grounded alignment
Memory-based adapter enabling feature sharing across stages
Multi-stage auxiliary guidance with coarse-to-fine granularity
🔎 Similar Papers
No similar papers found.
Jiechao Gao
Jiechao Gao
Stanford University
IoT&Cloud ComputingFederated LearningReinforcement LearningEnergy ManagementAI4Finance
C
Chang Liu
University of Science and Technology of China
Y
Yuangang Li
University of California, Irvine