Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited uncertainty awareness of existing biomedical vision-language models under domain shift and ambiguous image-text alignment, which hinders their adaptability in clinical low-data settings. The authors propose Evi-Steer, a framework grounded in Dempster-Shafer theory that introduces a cross-modal belief fusion mechanism. It enables robust adaptation of BiomedCLIP through evidence-driven, parameter-efficient fine-tuning—updating only 0.11% of parameters—while simultaneously estimating epistemic uncertainty to dynamically modulate residual gating. Evaluated across 15 datasets spanning eight anatomical organs and eight imaging modalities, the method consistently outperforms state-of-the-art approaches in both few-shot learning and domain generalization tasks, demonstrating strong potential for clinical deployment.
📝 Abstract
Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at https://github.com/HealthX-Lab/Evi-Steer.
Problem

Research questions and friction points this paper is trying to address.

biomedical vision-language models
domain shift
parameter-efficient adaptation
uncertainty awareness
few-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

evidential tuning
parameter-efficient adaptation
uncertainty-aware learning
cross-modal fusion
domain generalization
🔎 Similar Papers
No similar papers found.