🤖 AI Summary
This study addresses the lack of systematic evaluation of multimodal large language models (MLLMs) in the full clinical workflow of gastrointestinal endoscopy, particularly the absence of benchmarking against human physicians. The authors introduce GI-Bench, a comprehensive evaluation benchmark spanning five endoscopic procedural stages and 20 fine-grained lesion categories. Their analysis reveals a “knowledge–experience dissociation” in current MLLMs and identifies two critical challenges: a “spatial localization bottleneck” and a “fluency–accuracy paradox.” Comprehensive assessment of 12 leading MLLMs using Macro-F1, mIoU, and multidimensional Likert scales shows that top-performing models approach the diagnostic reasoning capability of junior endoscopists (Macro-F1: 0.641 vs. 0.727) but exhibit substantially weaker spatial localization (mIoU: 0.345 vs. >0.506) and are prone to factual errors and visual hallucinations.
📝 Abstract
Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical"spatial grounding bottleneck"persisted; human lesion localization (mIoU>0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a"fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to"over-interpretation"and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.