vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

πŸ“… 2025-11-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses three key bottlenecks in prompt learning for biomedical vision-language models (VLMs): semantic misalignment between LLMs and CLIP variants, poor scalability amid foundation model evolution, and the inadequacy of Euclidean optimization for modeling multimodal geometric structures. To this end, we propose a unified prompt learning framework grounded in spherical manifold geometry. Our core contributions are: (1) introducing an inverse von Mises–Fisher distribution to enforce cross-modal semantic alignment on a shared unit hypersphere; (2) designing a semantic anchor mechanism with three geometric constraints to ensure few-shot stability and cross-model scalability; and (3) integrating LLM-distilled prior-guided in-context learning with multimodal prompt tuning. Evaluated across 14 biomedical datasets, 12 imaging modalities, and 13 anatomical regions, our method significantly outperforms state-of-the-art approaches, achieving superior accuracy, generalization, and clinical applicability.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.
Problem

Research questions and friction points this paper is trying to address.

Aligns semantic biases between LLMs and CLIP models for biomedical prompting
Models unified representations on a hyperspherical manifold to reduce modality gaps
Enables robust few-shot classification across diverse medical imaging modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vMF distributions on hyperspherical manifold
Aligns LLM and CLIP biases via unified anchors
Applies three constraints for robust biomedical prompting
πŸ”Ž Similar Papers
No similar papers found.
M
Minye Shao
Department of Computer Science, Durham University, Durham, UK
S
Sihan Guo
Department of Computer Science, Durham University, Durham, UK
X
Xinrun Li
Department of Computer Science, Durham University, Durham, UK
Xingyu Miao
Xingyu Miao
Durham University, Department of computer Science
Haoran Duan
Haoran Duan
Tsinghua/Newcastle/Durham University
Multimodal AIGenerative AI
Yang Long
Yang Long
Department of Computer Science, Durham University
Computer VisionMachine LearningArtificial Intelligence