🤖 AI Summary
Medical image analysis is hindered by scarce annotated data and a substantial semantic gap between general-purpose vision-language models (VLMs) and the medical domain. Existing approaches often rely on unidirectional modality adaptation or prompt tuning, leading to misalignment between visual and textual representations. To address this, we propose a parameter-efficient framework for dynamic cross-modal interaction. Our method introduces a unified collaborative embedding Transformer coupled with orthogonal cross-attention adapters, enabling bidirectional and decoupled vision–language interaction. Additionally, we impose orthogonal regularization on modality-specific representation spaces to mitigate representation misalignment. With only 1.46M trainable parameters, our approach consistently outperforms unidirectional interaction and single-modality fine-tuning baselines across multiple medical vision–language tasks. Experimental results validate the effectiveness of dynamic cross-modal alignment and knowledge-separated learning in bridging domain-specific semantic gaps while maintaining high parameter efficiency.
📝 Abstract
Computer-aided medical image analysis is crucial for disease diagnosis and treatment planning, yet limited annotated datasets restrict medical-specific model development. While vision-language models (VLMs) like CLIP offer strong generalization capabilities, their direct application to medical imaging analysis is impeded by a significant domain gap. Existing approaches to bridge this gap, including prompt learning and one-way modality interaction techniques, typically focus on introducing domain knowledge to a single modality. Although this may offer performance gains, it often causes modality misalignment, thereby failing to unlock the full potential of VLMs. In this paper, we propose extbf{NEARL-CLIP} (iunderline{N}teracted quunderline{E}ry underline{A}daptation with ounderline{R}thogonaunderline{L} Regularization), a novel cross-modality interaction VLM-based framework that contains two contributions: (1) Unified Synergy Embedding Transformer (USEformer), which dynamically generates cross-modality queries to promote interaction between modalities, thus fostering the mutual enrichment and enhancement of multi-modal medical domain knowledge; (2) Orthogonal Cross-Attention Adapter (OCA). OCA introduces an orthogonality technique to decouple the new knowledge from USEformer into two distinct components: the truly novel information and the incremental knowledge. By isolating the learning process from the interference of incremental knowledge, OCA enables a more focused acquisition of new information, thereby further facilitating modality interaction and unleashing the capability of VLMs. Notably, NEARL-CLIP achieves these two contributions in a parameter-efficient style, which only introduces extbf{1.46M} learnable parameters.