SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In chest CT scans, pathological lesions exhibit spatial sparsity, and fine-grained semantic associations between report sentences and image subregions are implicit and non-bijective—hindering effective multimodal representation learning. To address this, we propose SimCroP, the first framework integrating similarity-driven alignment with cross-granularity fusion. SimCroP constructs joint vision-language representations via multimodal masked modeling, enables adaptive cross-granularity matching through sentence–region similarity scoring, and fuses local lesion structures with global anatomical context to enhance multi-scale pathological modeling. Pretrained on a large-scale CT–report paired dataset, SimCroP consistently outperforms state-of-the-art self-supervised and vision-language pretraining methods across five public benchmarks, achieving superior performance on both image classification and segmentation tasks. Its gains demonstrate substantial improvements in downstream clinical understanding and localization capabilities.

Technology Category

Application Category

📝 Abstract
Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP.
Problem

Research questions and friction points this paper is trying to address.

Addresses spatial sparsity of lesions in CT scans
Aligns radiology report sentences with corresponding image patches
Improves multi-scale pathology detection in medical images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Similarity-driven alignment for sentence-patch matching
Cross-granularity fusion integrating multi-level information
Multi-modal masked modeling for semantic understanding
Rongsheng Wang
Rongsheng Wang
The Chinese University of Hong Kong, Shenzhen
Deep Learning
Fenghe Tang
Fenghe Tang
University of Science and Technology of China
Medical Image AnalysisFoundation model
Qingsong Yao
Qingsong Yao
Stanford University | ICT, CAS
Medical Image ComputingMedical Image Analysis
R
Rui Yan
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC), Hefei Anhui, 230026, China; Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC, 215123, China
X
Xu Zhang
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC), Hefei Anhui, 230026, China; Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC, 215123, China; Anhui IFLYTEK CO., Ltd.
Z
Zhen Huang
Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC, 215123, China
Haoran Lai
Haoran Lai
University of Science and Technology of China
Medical Image ProcessingDeep Learning
Zhiyang He
Zhiyang He
Massachusetts Institute of Technology
Quantum Information
X
Xiaodong Tao
Anhui IFLYTEK CO., Ltd.
Zihang Jiang
Zihang Jiang
School of Biomedical Engineering, USTC, Suzhou Institute for Advanced Research
Computer VisionMedical Imaging3D
Shaohua Kevin Zhou
Shaohua Kevin Zhou
Professor, USTC, FAIMBE, FIAMBE, FIEEE, FMICCAI, FNAI
Medical Image ComputingComputer Vision & Pattern RecognitionMachine & Deep Learning