Cross-Modal Causal Representation Learning for Radiology Report Generation

📅 2023-03-16

🏛️ IEEE Transactions on Image Processing

📈 Citations: 6

✨ Influential: 1

career value

177K/year

🤖 AI Summary

Radiology report generation (RRG) suffers from inaccurate lesion description due to poor image quality, spurious vision–language correlations, and contextual bias. To address these challenges, we propose the first two-stage causal representation learning framework. In the pretraining stage, degradation-aware masked image modeling enhances semantic fidelity of low-quality images. In the fine-tuning stage, front-door and back-door causal interventions decouple local and global visual features; visual debiasing modules (VDM) and language debiasing modules (LDM) are introduced to suppress cross-modal biases. The framework integrates a multi-path cross-modal architecture with text prefix/suffix generation strategies. It achieves significant improvements over state-of-the-art methods on IU-Xray and MIMIC-CXR. Ablation studies confirm the necessity of both stages. Code and models are publicly available.

📝 Abstract

Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance, which can relieve the heavy burden of radiologists by automatically generating the corresponding radiology reports according to the given radiology image. However, generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases and inherent limitations of radiological imaging, such as low resolution and noise interference. To address these issues, we propose a two-stage framework named Cross-Modal Causal Representation Learning (CMCRL), consisting of the Radiological Cross-modal Alignment and Reconstruction Enhanced (RadCARE) pre-training and the Visual-Linguistic Causal Intervention (VLCI) fine-tuning. In the pre-training stage, RadCARE introduces a degradation-aware masked image restoration strategy tailored for radiological images, which reconstructs high-resolution patches from low-resolution inputs to mitigate noise and detail loss. Combined with a multiway architecture and four adaptive training strategies (e.g., text postfix generation with degraded images and text prefixes), RadCARE establishes robust cross-modal correlations even with incomplete data. In the VLCI phase, we deploy causal front-door intervention through two modules: the Visual Deconfounding Module (VDM) disentangles local-global features without fine-grained annotations, while the Linguistic Deconfounding Module (LDM) eliminates context bias without external terminology databases. Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods, with ablation studies confirming the necessity of both stages. Code and models are available at https://github.com/WissingChen/CMCRL.

Problem

Research questions and friction points this paper is trying to address.

Generating accurate radiology reports from low-quality images

Reducing spurious visual-linguistic biases in report generation

Improving cross-modal alignment without fine-grained annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Degradation-aware masked image restoration strategy

Visual-Linguistic Causal Intervention framework

Local-global feature disentanglement without annotations

🔎 Similar Papers

No similar papers found.