🤖 AI Summary
This work addresses the scarcity of large-scale, high-quality image-instruction-output triplets in the medical domain, which hinders effective instruction tuning of large vision-language models (LVLMs). To overcome this limitation, the authors propose a novel fine-tuning approach that eliminates the need for human-annotated instructions by dynamically substituting explicit instructions with momentum-based proxy instructions and incorporating an answer-shuffling strategy. This method enables the preservation and transfer of instruction-following capabilities using only image-caption pairs, substantially reducing reliance on expert annotations. Evaluated across multiple medical multiple-choice visual question answering benchmarks—including SKINCON, WBCAtt, CBIS, and MIMIC-CXR—the approach achieves state-of-the-art performance, significantly enhancing both the fine-tuning efficiency and generalization capability of medical LVLMs.
📝 Abstract
Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.