🤖 AI Summary
This work addresses the limitations of existing vision-language models in ultrasound visual question answering (VQA), which often fail to accurately localize lesions and overlook the inherent subjectivity and ambiguity in clinical annotations. To bridge this gap, the authors propose an interactive “zoom-then-diagnose” reasoning framework that emulates radiologists’ cognitive processes by integrating an active focusing mechanism with uncertainty-aware rewards—encouraging confident predictions on clear cases while promoting caution under ambiguity. The approach combines a structured vision-language model, active zooming, a confidence-aware reward based on Group Relative Policy Optimization (GRPO), and consistency constraints. Notably, it is the first to incorporate active focusing and explicit modeling of subjective uncertainty into ultrasound VQA. Evaluated on liver, breast, and thyroid ultrasound datasets, the method improves lesion localization accuracy by 39.3%, substantially enhancing both focus precision and diagnostic robustness.
📝 Abstract
Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.