GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic evaluation of multimodal large language models (MLLMs) in the full clinical workflow of gastrointestinal endoscopy, particularly the absence of benchmarking against human physicians. The authors introduce GI-Bench, a comprehensive evaluation benchmark spanning five endoscopic procedural stages and 20 fine-grained lesion categories. Their analysis reveals a “knowledge–experience dissociation” in current MLLMs and identifies two critical challenges: a “spatial localization bottleneck” and a “fluency–accuracy paradox.” Comprehensive assessment of 12 leading MLLMs using Macro-F1, mIoU, and multidimensional Likert scales shows that top-performing models approach the diagnostic reasoning capability of junior endoscopists (Macro-F1: 0.641 vs. 0.727) but exhibit substantially weaker spatial localization (mIoU: 0.345 vs. >0.506) and are prone to factual errors and visual hallucinations.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical"spatial grounding bottleneck"persisted; human lesion localization (mIoU>0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a"fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to"over-interpretation"and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Gastrointestinal Endoscopy
Clinical Benchmarking
Spatial Grounding
Hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

GI-Bench
multimodal large language models
spatial grounding bottleneck
fluency-accuracy paradox
clinical benchmarking
🔎 Similar Papers
No similar papers found.
Y
Yan Zhu
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
T
Te Luo
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
P
Pei-yao Fu
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
Zhen Zhang
Zhen Zhang
California Research Center(Agilent Technologies), Argonne National Lab, Northwestern University
Photoacoustic MicroscopyApplied OpticsInfrared MicroscopyNear-field Scanning Optical Microscopy
Z
Zi-Long Wang
Microsoft Research Asia, Shanghai, 200232, China.
Y
Yi-Fan Qu
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
Z
Zifan Geng
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
J
Jia-qi Xu
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
L
Lu Yao
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
L
Li-yun Ma
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
W
Wei Su
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
W
Wei-Feng Chen
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
Quan-Lin Li
Quan-Lin Li
Beijing University of Technology
Block-Structured Markov ProcessesRG-factorizationQueueing NetworksSharing EconomicsComputer Networks
Shuo Wang
Shuo Wang
Fudan University
AI for Multi-Modal MedicineMedical Image AnalysisBiomechanics
P
Ping-Hong Zhou
Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.