Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical vision-language pretraining suffers from cross-modal alignment bias due to semantic density mismatch between low signal-to-noise-ratio medical images and high signal-to-noise-ratio clinical reports. To address this, we propose a high-semantic-density visual representation learning framework. First, disease-level contrastive learning is introduced to enhance fine-grained discriminative capability. Second, we innovatively construct an anatomy-level normality prior model: a VQ-VAE reconstructs the latent-space distribution of normal anatomical appearances, and distributional shift amplification is applied to enhance anomaly signals, thereby significantly improving lesion perception. Evaluated on multi-center chest and abdominal CT datasets, our method achieves state-of-the-art zero-shot diagnostic performance—attaining a mean AUC of 84.9% across 54 diseases spanning 15 organs—outperforming existing approaches by a substantial margin. Moreover, it demonstrates strong cross-domain generalizability.

Technology Category

Application Category

📝 Abstract
Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model's ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model's perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at https://github.com/alibaba-damo-academy/ViSD-Boost.
Problem

Research questions and friction points this paper is trying to address.

Bridging semantic density gap between medical images and reports
Enhancing visual alignment via disease-level contrastive learning
Modeling anatomical normality to amplify abnormal signal detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disease-level vision contrastive learning enhances semantics
Anatomical normality modeling with VQ-VAE reconstructs embeddings
Amplifies abnormal signals via distribution shifts detection
🔎 Similar Papers
No similar papers found.
Weiwei Cao
Weiwei Cao
Alibaba DAMO Academy, Zhejiang University
Medical Image AnalysisVision and Language
J
Jianpeng Zhang
College of Computer Science and Technology, Zhejiang University
Zhongyi Shui
Zhongyi Shui
Ph.D. Candidate,Westlake University & Zhejiang University
Sinuo Wang
Sinuo Wang
PhD Candidate, The University of Adelaide
Vision-Language Machine Learning
Z
Zeli Chen
DAMO Academy, Alibaba Group
X
Xi Li
College of Computer Science and Technology, Zhejiang University
Le Lu
Le Lu
Ant Group, IEEE Fellow, MICCAI Board Member (2021-2025)
Computer VisionMedical Image AnalysisMedical Image ComputingBiomedical Image Analysis
X
Xianghua Ye
The First Affiliated Hospital of College of Medicine, Zhejiang University
T
Tingbo Liang
The First Affiliated Hospital of College of Medicine, Zhejiang University
Q
Qi Zhang
The First Affiliated Hospital of College of Medicine, Zhejiang University
Ling Zhang
Ling Zhang
Alibaba DAMO Academy USA
Medical Image AnalysisMedical Image ComputingMachine LearningImage Processing