BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the performance instability of medical vision–language models under noisy and heterogeneous clinical text descriptions caused by prompt variations. To this end, the authors propose a dual-anchor prompt learning framework that dynamically filters irrelevant textual information through a gated cross-modal fusion mechanism and jointly aligns image–text prompts. The high anchor enforces semantic consistency via expert-designed templates, while the low anchor enhances representational stability using few-shot visual prototypes. Integrated with parameter-efficient fine-tuning and gated cross-modal attention, the proposed model significantly outperforms existing methods across 11 medical benchmarks, demonstrating superior accuracy and robustness under both few-shot learning and prompt perturbation settings.

📝 Abstract

Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Prompt Learning

Few-shot Learning

Cross-Modal Alignment

Prompt Robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Cross-Modal Fusion

Dual-Anchor Constraint

Vision-Language Models