Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of large-scale, high-quality image-instruction-output triplets in the medical domain, which hinders effective instruction tuning of large vision-language models (LVLMs). To overcome this limitation, the authors propose a novel fine-tuning approach that eliminates the need for human-annotated instructions by dynamically substituting explicit instructions with momentum-based proxy instructions and incorporating an answer-shuffling strategy. This method enables the preservation and transfer of instruction-following capabilities using only image-caption pairs, substantially reducing reliance on expert annotations. Evaluated across multiple medical multiple-choice visual question answering benchmarks—including SKINCON, WBCAtt, CBIS, and MIMIC-CXR—the approach achieves state-of-the-art performance, significantly enhancing both the fine-tuning efficiency and generalization capability of medical LVLMs.

Technology Category

Application Category

📝 Abstract
Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
Problem

Research questions and friction points this paper is trying to address.

medical instruction following
large vision language models
instruction tuning
medical domain
expert knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-free tuning
momentum proxy instruction
large vision language models
medical visual question answering
response shuffling
🔎 Similar Papers
No similar papers found.
M
Myeongkyun Kang
Department of Electrical and Computer Engineering, The University of British Columbia, Vancouver, BC V6T 1Z4, Canada; Department of Robotics and Mechatronics Engineering, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu, 42988, Republic of Korea
S
Soopil Kim
Division of Intelligent Robot, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu, 42988, Republic of Korea
Xiaoxiao Li
Xiaoxiao Li
Assistant Professor, UBC; Vector Institute; CIFAR AI Chair; Canada Research Chair
Deep LearningTrustworthy AIAI for Healthcare
Sang Hyun Park
Sang Hyun Park
Daegu Gyeongbuk Institute of Science & Technology, South Korea
Medical Image AnalysisComputer VisionMachine Learning