Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the scarcity of large-scale, high-quality image-instruction-output triplets in the medical domain, which hinders effective instruction tuning of large vision-language models (LVLMs). To overcome this limitation, the authors propose a novel fine-tuning approach that eliminates the need for human-annotated instructions by dynamically substituting explicit instructions with momentum-based proxy instructions and incorporating an answer-shuffling strategy. This method enables the preservation and transfer of instruction-following capabilities using only image-caption pairs, substantially reducing reliance on expert annotations. Evaluated across multiple medical multiple-choice visual question answering benchmarks—including SKINCON, WBCAtt, CBIS, and MIMIC-CXR—the approach achieves state-of-the-art performance, significantly enhancing both the fine-tuning efficiency and generalization capability of medical LVLMs.

Technology Category

Application Category

📝 Abstract
Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
Problem

Research questions and friction points this paper is trying to address.

medical instruction following
large vision language models
instruction tuning
medical domain
expert knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-free tuning
momentum proxy instruction
large vision language models
medical visual question answering
response shuffling
🔎 Similar Papers
No similar papers found.