Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the scarcity of large-scale, high-quality image-instruction-output triplets in the medical domain, which hinders effective instruction tuning of large vision-language models (LVLMs). To overcome this limitation, the authors propose a novel fine-tuning approach that eliminates the need for human-annotated instructions by dynamically substituting explicit instructions with momentum-based proxy instructions and incorporating an answer-shuffling strategy. This method enables the preservation and transfer of instruction-following capabilities using only image-caption pairs, substantially reducing reliance on expert annotations. Evaluated across multiple medical multiple-choice visual question answering benchmarks—including SKINCON, WBCAtt, CBIS, and MIMIC-CXR—the approach achieves state-of-the-art performance, significantly enhancing both the fine-tuning efficiency and generalization capability of medical LVLMs.

Technology Category

Application Category

📝 Abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.

Problem

Research questions and friction points this paper is trying to address.

medical instruction following

large vision language models

instruction tuning

medical domain

expert knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-free tuning

momentum proxy instruction

large vision language models