🤖 AI Summary
Medical vision-language models (MedVLMs) suffer from inherent probabilistic uncertainty, often generating unverified and erroneous responses that compromise clinical reliability. Existing mitigation strategies rely on costly fine-tuning and struggle to achieve deep alignment with domain-specific clinical knowledge. To address this, we propose a **fine-tuning-free expert-cooperative control framework**: first, uncertainty estimation identifies unreliable model outputs; second, external medical knowledge retrieval is integrated with expert-annotated key information highlighting; third, classifier-free guidance dynamically modulates token-level semantic representations, enabling uncertainty-driven, closed-loop expert refinement. Evaluated on three medical visual question answering benchmarks, our approach—using only a 4.2B-parameter model and minimal expert annotations—outperforms state-of-the-art 13B-parameter models, significantly enhancing clinical consistency and feasibility for resource-constrained deployment.
📝 Abstract
The rapid advancements in Vision Language Models (VLMs) have prompted the development of multi-modal medical assistant systems. Despite this progress, current models still have inherent probabilistic uncertainties, often producing erroneous or unverified responses-an issue with serious implications in medical applications. Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning. However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. This framework introduces an uncertainty estimation strategy to identify unreliable outputs. It then retrieves relevant references to assist experts in highlighting key terms and applies classifier-free guidance to refine the token embeddings of MedVLM, ensuring that the adjusted outputs are correct and align with expert highlights. Evaluations across three medical visual question answering benchmarks demonstrate that the proposed Expert-CFG, with 4.2B parameters and limited expert annotations, outperforms state-of-the-art models with 13B parameters. The results demonstrate the feasibility of deploying such a system in resource-limited settings for clinical use.