🤖 AI Summary
Existing endoscopic multimodal large language model (MLLM) benchmarks suffer from narrow clinical coverage and task monotony, failing to reflect real-world clinical diversity and workflow requirements. To address this, we propose EndoBench—the first comprehensive, full-scenario endoscopic MLLM benchmark—spanning four endoscopic modalities, twelve clinically grounded tasks, and five levels of visual prompting granularity, with 6,832 expert-validated VQA samples. We introduce a novel multidimensional clinical workflow-aligned evaluation framework and release the first open-source, large-scale, multi-granularity endoscopy-specific benchmark, using expert performance as the gold standard. Through systematic evaluation of 23 state-of-the-art models—including VQA-based assessment, clinical knowledge enhancement, cross-dataset standardized annotation, and sensitivity analysis—we find that disease-specific fine-tuning significantly improves performance; commercial models outperform open-source counterparts but remain substantially below human experts, exposing critical bottlenecks in complex clinical reasoning.
📝 Abstract
Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow--spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations--to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning. We publicly release our benchmark and code.