🤖 AI Summary
Multimodal large language models (MLLMs) excel on general vision tasks but suffer significant performance degradation on out-of-distribution (OOD) tasks in specialized domains—such as medical imaging—due to severe scarcity of labeled data. Method: We propose a low-resource, parameter-efficient adaptation framework comprising three components: (1) collaborative question-answer (QA) pair generation via a QA generator and caption distillation; (2) caption distillation regularization to enforce representation consistency; and (3) selective fine-tuning of only QA-task-relevant neurons. Contribution/Results: By synergistically leveraging limited labeled data and abundant unlabeled data, our method achieves substantial improvements over standard full-parameter fine-tuning on two low-supervision benchmarks—gastrointestinal endoscopy QA and sports vision QA—demonstrating strong generalization under extreme data scarcity and high computational efficiency.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.