Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

📅 2026-03-05
🏛️ Information Processing in Medical Imaging
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automated radiology report generation from complex and detail-rich CT images by proposing a two-stage structure-aware framework. In the first stage, learnable structural visual queries are aligned with structured textual descriptions through contrastive learning to achieve anatomical-level image-text semantic alignment. In the second stage, the visual queries are frozen to focus on clinically relevant regions, while a sequence-to-sequence decoder is trained to generate coherent reports. The approach introduces a novel structure-level image-text contrastive loss, a soft pseudo-target mechanism based on textual similarity to mitigate false-negative issues, and a dynamically diversified negative sample queue to enhance abnormality discrimination. Evaluated on two public CT report datasets, the method achieves state-of-the-art performance, significantly improving both the accuracy and clinical utility of generated reports.

Technology Category

Application Category

📝 Abstract
Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.
Problem

Research questions and friction points this paper is trying to address.

Computed Tomography Report Generation
Image-Text Contrastive Learning
Clinical Radiology Reporting
Structure Observation
Automated Medical Report
Innovation

Methods, ideas, or system contributions that make the work stand out.

structure-aware contrastive learning
visual queries
soft pseudo targets
negative queue
CT report generation
🔎 Similar Papers
No similar papers found.
H
Hong Liu
School of Informatics, Xiamen University, Xiamen, China; National Institute for Data Science in Health and Medicine, Xiamen University
D
Dong Wei
Jarvis Research Center, Tencent YouTu Lab, Shenzhen, China
Q
Qiong Peng
School of Informatics, Xiamen University, Xiamen, China
Y
Yawen Huang
Jarvis Research Center, Tencent YouTu Lab, Shenzhen, China
Xian Wu
Xian Wu
Director of Tencent Jarvis Lab
large language modeldata miningmachine learning
Yefeng Zheng
Yefeng Zheng
Professor, Westlake University, Hangzhou, China, IEEE Fellow, AIMBE Fellow
AI in HealthMedical ImagingComputer VisionNatural Language ProcessingLarge Language Model
L
Liansheng Wang
School of Informatics, Xiamen University, Xiamen, China; National Institute for Data Science in Health and Medicine, Xiamen University