Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models.

๐Ÿ“… 2025-06-02
๐Ÿ›๏ธ IEEE Transactions on Medical Imaging
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the weak visual representation capability of CLIP-style methods and poor cross-modal alignment in masked vision modeling (MVM) pre-trained models for medical vision-language alignment, this paper proposes ALTAโ€”a lightweight visual encoder adaptation framework. ALTA efficiently adapts MVM-based visual backbones with only 8% parameter fine-tuning, introduces a novel temporal multi-view X-ray modeling mechanism to enhance image-text semantic consistency, and integrates cross-modal contrastive learning for end-to-end optimization. On medical imageโ€“report alignment, ALTA achieves +4.2% accuracy in text-to-image retrieval and +5.8% in image-to-text retrieval over CLIP baselines, while demonstrating strong zero-shot classification generalization. With minimal computational overhead, ALTA effectively alleviates cross-modal representation bottlenecks, establishing a new paradigm for resource-constrained medical multimodal learning.

Technology Category

Application Category

๐Ÿ“ Abstract
Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.
Problem

Research questions and friction points this paper is trying to address.

Improves medical vision-language alignment efficiency
Enhances visual representation in cross-modal learning
Optimizes radiograph-report information consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts masked vision models for alignment
Integrates temporal-multiview radiograph inputs
Uses minimal trainable parameters efficiently
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chenyu Lian
School of Informatics, Xiamen University, Xiamen 361005, China, and the Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China
Hong-Yu Zhou
Hong-Yu Zhou
Assistant Professor of Biomedical Engineering, Tsinghua University. Past: Harvard Medical School.
AI for HealthcareAI for MedicineBiomedical AI
D
Dongyun Liang
Department of Radiology, Zhongshan Hospital (Xiamen), Fudan University, Xiamen Municipal Clinical Research Center for Medical Imaging, Fujian Province Key Clinical Specialty for Medical Imaging, Xiamen Key Laboratory of Clinical Transformation of Imaging Big Data and Artificial Intelligence, Xiamen 361015, China
Jing Qin
Jing Qin
University of Southern Denmark
MathematicsStatistics
L
Liansheng Wang
National Institute for Data Science in Health and Medicine, and the Department of Computer Science, School of Informatics, Xiamen University, Xiamen 361005, China