MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A systematic, multimodal medical evaluation benchmark is lacking, hindering reliable assessment and advancement of General Medical AI (GMAI) and Multimodal Large Language Models (MLLMs); high-quality medical textbooks remain underutilized for benchmark construction. Method: We propose the first textbook-driven multimodal medical evaluation benchmark, built automatically from open-access medical textbooks via a pipeline for image-text alignment, yielding 5,000 clinical visual question-answering (VQA) items spanning five core tasks: imaging recognition, disease classification, anatomical localization, differential diagnosis, and treatment planning. Contribution/Results: We introduce a novel three-tier medical knowledge annotation schema—covering 42 modalities, 125 anatomical structures, and 31 clinical specialties—enabling fine-grained capability attribution. Experiments expose critical weaknesses in state-of-the-art MLLMs, particularly in anatomical understanding and specialty-specific reasoning. Our benchmark provides the first interpretable, anatomy- and specialty-organized performance evaluation framework for multimodal medical AI.

Technology Category

Application Category

📝 Abstract
The accelerating development of general medical artificial intelligence (GMAI), powered by multimodal large language models (MLLMs), offers transformative potential for addressing persistent healthcare challenges, including workforce deficits and escalating costs. The parallel development of systematic evaluation benchmarks emerges as a critical imperative to enable performance assessment and provide technological guidance. Meanwhile, as an invaluable knowledge source, the potential of medical textbooks for benchmark development remains underexploited. Here, we present MedBookVQA, a systematic and comprehensive multimodal benchmark derived from open-access medical textbooks. To curate this benchmark, we propose a standardized pipeline for automated extraction of medical figures while contextually aligning them with corresponding medical narratives. Based on this curated data, we generate 5,000 clinically relevant questions spanning modality recognition, disease classification, anatomical identification, symptom diagnosis, and surgical procedures. A multi-tier annotation system categorizes queries through hierarchical taxonomies encompassing medical imaging modalities (42 categories), body anatomies (125 structures), and clinical specialties (31 departments), enabling nuanced analysis across medical subdomains. We evaluate a wide array of MLLMs, including proprietary, open-sourced, medical, and reasoning models, revealing significant performance disparities across task types and model categories. Our findings highlight critical capability gaps in current GMAI systems while establishing textbook-derived multimodal benchmarks as essential evaluation tools. MedBookVQA establishes textbook-derived benchmarking as a critical paradigm for advancing clinical AI, exposing limitations in GMAI systems while providing anatomically structured performance metrics across specialties.
Problem

Research questions and friction points this paper is trying to address.

Developing a systematic medical benchmark from open-access textbooks
Assessing multimodal AI performance across clinical tasks
Identifying gaps in general medical AI capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated extraction of medical figures
Contextual alignment with medical narratives
Multi-tier annotation system for queries
🔎 Similar Papers
No similar papers found.
S
Sau Lai Yip
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
Sunan He
Sunan He
Hong Kong University of Science and Technology
Multi-Modal Learning
Yuxiang Nie
Yuxiang Nie
Hong Kong University of Science and Technology
Natural language processingMulti-modal LearningMedical Image Analysis
S
Shu Pui Chan
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
Yilin Ye
Yilin Ye
student, Hong Kong University of Science and Technology
visualizationhuman computer interaction
S
Sum Ying Lam
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
H
Hao Chen
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology; Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology; Division of Life Science, The Hong Kong University of Science and Technology