🤖 AI Summary
This study addresses the absence of a multimodal evaluation benchmark tailored to the Korean Medical Licensing Examination, particularly for complex medical image understanding tasks requiring cross-image reasoning. To this end, the authors introduce KorMedVQA, the first Korean multimodal visual question answering benchmark comprising 1,534 questions and 2,043 clinical images spanning diverse modalities such as X-ray, CT, and electrocardiograms. The dataset supports both single- and multi-image reasoning and, together with the textual benchmark KorMedMCQA, forms a comprehensive evaluation suite. Under a unified zero-shot protocol, over 50 general-purpose, medical-specific, and Korean-specialized models are systematically evaluated. Results show that the best closed-source model achieves 96.9% accuracy, open-source models reach 83.7%, while Korean-specialized models lag significantly at 43.2%. Performance consistently degrades on multi-image questions, and substantial variation is observed across imaging modalities.
📝 Abstract
We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: https://huggingface.co/datasets/seongsubae/KorMedMCQA-V.