Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Medical report generation (MRG) faces three key challenges: insufficient domain knowledge modeling, misalignment in fine-grained visual–textual entity embeddings, and spurious cross-modal correlations. To address these, we propose a hierarchical task-decomposition framework—the first to jointly integrate domain knowledge understanding, fine-grained vision–language alignment, and causal debiasing within a unified architecture. Methodologically, our approach synergistically combines prefix-based language modeling, masked image modeling, and spatially aware feature alignment, augmented by a front-end gated causal intervention mechanism to enable robust cross-modal causal reasoning. Evaluated on multiple public benchmarks, our model consistently outperforms state-of-the-art methods, achieving significant improvements in both report accuracy and clinical interpretability. Moreover, it markedly reduces reliance on dataset-specific biases, thereby enhancing generalization and robustness.

Technology Category

Application Category

📝 Abstract

Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists'burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.

Problem

Research questions and friction points this paper is trying to address.

Address insufficient medical domain knowledge understanding in report generation

Improve poor alignment between text and visual entity embeddings

Reduce spurious correlations from cross-modal biases in medical imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical task decomposition for multi-level challenges

Prefix and masked modeling for cross-modal alignment

Front-door causal intervention to reduce spurious correlations

🔎 Similar Papers

Cross-Modal Causal Representation Learning for Radiology Report Generation