K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of transferring vision-language models pretrained on high-end medical imaging modalities (e.g., CT) to low-end modalities (e.g., X-ray), where models often rely on modality-specific shortcuts, hindering generalization. To overcome this, the authors propose K-MaT, a novel framework that, for the first time, integrates Fused Gromov-Wasserstein optimal transport to align cross-modal prompt manifolds, coupled with clinical text-anchored prompting to guide knowledge transfer. Remarkably, K-MaT achieves zero-shot adaptation to low-end modalities without requiring any training images from the target modality. By synergistically combining prompt learning, manifold alignment, and knowledge anchoring, the method attains state-of-the-art performance across four cross-modal medical imaging benchmarks, achieving a harmonic mean accuracy of 44.1% and a macro-F1 score of 36.2%, while substantially mitigating catastrophic forgetting.

Technology Category

Application Category

📝 Abstract

Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp's 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.

Problem

Research questions and friction points this paper is trying to address.

cross-modal transfer

medical imaging

vision-language models

modality gap

zero-shot deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Learning

Optimal Transport

Cross-Modal Transfer