🤖 AI Summary
This work addresses the challenge of abnormality localization in clinical rare diseases, where scarce annotated data hinders supervised fine-tuning and single-pass inference yields unstable predictions. To overcome this, the authors propose a Dynamic Decision Learning (DDL) framework that leverages a frozen large-scale vision-language model at test time without requiring fine-tuning. DDL iteratively refines instructions and generates multi-round predictions under visual perturbations, then aggregates consistent outcomes to produce a reliability score. This approach pioneers test-time multi-round decision evolution and achieves up to a 105% relative improvement in mAP@75 on a brain imaging dataset encompassing 281 pathological conditions, substantially outperforming various adaptation baselines and supervised fine-tuning methods. Moreover, DDL demonstrates superior confidence calibration under distribution shifts.
📝 Abstract
Clinical abnormality grounding for rare diseases is often hindered by data scarcity, making supervised fine-tuning impractical and single-pass inference highly unstable. We propose Dynamic Decision Learning (DDL), a framework that enables frozen large vision-language models (LVLMs) to refine their decisions across both language and visual spaces by optimizing instructions and consolidating predictions under visual perturbations. This process improves localization quality and produces a consensus-based reliability score that quantifies model confidence. Results on brain imaging benchmarks, including a rare-disease dataset with 281 pathology types across models ranging from 3B to 72B parameters, show that DDL improves mAP@75 by up to 105% on rare-disease cases and outperforms adaptation baselines and supervised fine-tuning. Furthermore, DDL demonstrates stronger calibration between reliability scores and localization accuracy under severe distribution shifts and increasing task difficulty. Code is available at: https://lijunrio.github.io/DDL/