🤖 AI Summary
Evaluating the diagnostic efficacy of vision-language models (VLMs) for polyp detection (CADe) and classification (CADx) in colonoscopy images remains challenging due to the lack of standardized, pathology-annotated benchmarks and systematic comparisons against conventional models. Method: We establish the first unified medical imaging evaluation framework, benchmarking state-of-the-art VLMs—including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, BiomedCLIP, and CLIP—against CNNs (e.g., ResNet50) and traditional machine learning models (e.g., SVM, Random Forest) on a clinical dataset of 2,258 pathology-confirmed colonoscopy images. Contribution/Results: ResNet50 achieves the highest CADe performance (F1 = 91.35%, AUROC = 0.98). Among VLMs, GPT-4 significantly outperforms other general-purpose models in both CADe (F1 = 81.02%) and CADx (weighted F1 = 41.18%), while BiomedCLIP demonstrates promising few-shot capability. Our findings validate that domain-adapted VLMs hold practical potential for clinical decision support under data-constrained settings.
📝 Abstract
Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.