How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether general-purpose vision-language models can effectively acquire specialized medical knowledge from non-standardized, highly heterogeneous publicly available medical educational videos (e.g., YouTube biomedical videos). To address this, we introduce OpenBiomedVi—the first large-scale human-AI collaborative annotated medical video instruction-tuning dataset (1,031 hours)—and release two expert-designed evaluation benchmarks: MIMICEchoQA and SurgeryVideoQA. We propose a video-text-QA triplet construction paradigm grounded in human-AI collaboration, establishing a comprehensive instruction-tuning and evaluation framework for medical video understanding. Fine-tuning Qwen-2-VL on OpenBiomedVi yields average improvements of 98.7% across video tasks, with gains of 99.1% on MIMICEchoQA and 98.1% on SurgeryVideoQA. These results empirically validate that non-standardized instructional videos serve as highly effective supervisory signals, substantially enhancing both medical video comprehension and cross-domain generalization capabilities of vision-language models.

Technology Category

Application Category

📝 Abstract
Publicly available biomedical videos, such as those on YouTube, serve as valuable educational resources for medical students. Unlike standard machine learning datasets, these videos are designed for human learners, often mixing medical imagery with narration, explanatory diagrams, and contextual framing. In this work, we investigate whether such pedagogically rich, yet non-standardized and heterogeneous videos can effectively teach general-domain vision-language models biomedical knowledge. To this end, we introduce OpenBiomedVi, a biomedical video instruction tuning dataset comprising 1031 hours of video-caption and Q/A pairs, curated through a multi-step human-in-the-loop pipeline. Diverse biomedical video datasets are rare, and OpenBiomedVid fills an important gap by providing instruction-style supervision grounded in real-world educational content. Surprisingly, despite the informal and heterogeneous nature of these videos, the fine-tuned Qwen-2-VL models exhibit substantial performance improvements across most benchmarks. The 2B model achieves gains of 98.7% on video tasks, 71.2% on image tasks, and 0.2% on text tasks. The 7B model shows improvements of 37.09% on video and 11.2% on image tasks, with a slight degradation of 2.7% on text tasks compared to their respective base models. To address the lack of standardized biomedical video evaluation datasets, we also introduce two new expert curated benchmarks, MIMICEchoQA and SurgeryVideoQA. On these benchmarks, the 2B model achieves gains of 99.1% and 98.1%, while the 7B model shows gains of 22.5% and 52.1%, respectively, demonstrating the models' ability to generalize and perform biomedical video understanding on cleaner and more standardized datasets than those seen during training. These results suggest that educational videos created for human learning offer a surprisingly effective training signal for biomedical VLMs.
Problem

Research questions and friction points this paper is trying to address.

Can general vision-language models learn medicine from educational videos?
Evaluating OpenBiomedVi dataset for biomedical video instruction tuning.
Assessing model performance on new benchmarks MIMICEchoQA and SurgeryVideoQA.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes OpenBiomedVi dataset for biomedical video instruction
Fine-tunes Qwen-2-VL models for improved performance
Introduces MIMICEchoQA and SurgeryVideoQA evaluation benchmarks
🔎 Similar Papers
No similar papers found.