🤖 AI Summary
This work addresses the performance instability of medical vision–language models under noisy and heterogeneous clinical text descriptions caused by prompt variations. To this end, the authors propose a dual-anchor prompt learning framework that dynamically filters irrelevant textual information through a gated cross-modal fusion mechanism and jointly aligns image–text prompts. The high anchor enforces semantic consistency via expert-designed templates, while the low anchor enhances representational stability using few-shot visual prototypes. Integrated with parameter-efficient fine-tuning and gated cross-modal attention, the proposed model significantly outperforms existing methods across 11 medical benchmarks, demonstrating superior accuracy and robustness under both few-shot learning and prompt perturbation settings.
📝 Abstract
Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment.
To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors).
Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations.
Our code is available at: https://github.com/tongdiedie/BiomedAP.
Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning