Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Evaluating the diagnostic efficacy of vision-language models (VLMs) for polyp detection (CADe) and classification (CADx) in colonoscopy images remains challenging due to the lack of standardized, pathology-annotated benchmarks and systematic comparisons against conventional models. Method: We establish the first unified medical imaging evaluation framework, benchmarking state-of-the-art VLMs—including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, BiomedCLIP, and CLIP—against CNNs (e.g., ResNet50) and traditional machine learning models (e.g., SVM, Random Forest) on a clinical dataset of 2,258 pathology-confirmed colonoscopy images. Contribution/Results: ResNet50 achieves the highest CADe performance (F1 = 91.35%, AUROC = 0.98). Among VLMs, GPT-4 significantly outperforms other general-purpose models in both CADe (F1 = 81.02%) and CADx (weighted F1 = 41.18%), while BiomedCLIP demonstrates promising few-shot capability. Our findings validate that domain-adapted VLMs hold practical potential for clinical decision support under data-constrained settings.

Technology Category

Application Category

📝 Abstract

Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.

Problem

Research questions and friction points this paper is trying to address.

Compare VLMs and ML models for polyp detection in colonoscopy images

Evaluate performance of 11 models on CADe and CADx tasks

Assess ResNet50, BiomedCLIP, and GPT-4 for polyp classification accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used vision-language models for polyp detection

Compared VLMs with CNNs and classic ML

Standardized image preprocessing and augmentation techniques

🔎 Similar Papers

No similar papers found.