🤖 AI Summary
To address the limitations of multimodal large language models (MLLMs) in high-stakes medical applications—particularly weak fine-grained visual perception, shallow medical knowledge comprehension, and insufficient clinical reasoning in ophthalmic surgery—this paper introduces EyePCR, the first multi-level benchmark for surgical cognitive assessment. EyePCR evaluates three core competencies: visual perception, knowledge understanding, and clinical reasoning, incorporating fine-grained attribute annotations, a large-scale ophthalmic knowledge graph, and clinically grounded reasoning tasks. Building upon EyePCR, we propose EyePCR-MLLM, a domain-specific model integrating structured knowledge modeling, vision-language question generation, knowledge graph augmentation, and domain-adaptive training. Experimental results demonstrate that EyePCR-MLLM outperforms leading open-source MLLMs on perception-oriented multiple-choice questions and achieves performance on par with GPT-4.1 on knowledge understanding and clinical reasoning tasks, substantially enhancing the cognitive reliability and clinical applicability of surgical video analysis.
📝 Abstract
MLLMs (Multimodal Large Language Models) have showcased remarkable capabilities, but their performance in high-stakes, domain-specific scenarios like surgical settings, remains largely under-explored. To address this gap, we develop extbf{EyePCR}, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge to evaluate cognition across extit{Perception}, extit{Comprehension} and extit{Reasoning}. EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception, medical knowledge graph of more than 25k triplets for comprehension, and four clinically grounded reasoning tasks. The rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions, thus greatly improving models' cognitive ability. In particular, extbf{EyePCR-MLLM}, a domain-adapted variant of Qwen2.5-VL-7B, achieves the highest accuracy on MCQs for extit{Perception} among compared models and outperforms open-source models in extit{Comprehension} and extit{Reasoning}, rivalling commercial models like GPT-4.1. EyePCR reveals the limitations of existing MLLMs in surgical cognition and lays the foundation for benchmarking and enhancing clinical reliability of surgical video understanding models.