🤖 AI Summary
This study addresses the challenge of classifying polyp size (≤5mm vs. >5mm) in monocular colonoscopy, where the absence of reliable scale references impedes clinical decision-making. Through multi-center data, diverse model architectures, and patient-stratified cross-validation, the authors systematically audit model behavior and reveal a reliance on spurious procedural cues rather than genuine scale information. The work identifies scale awareness and segmentation robustness as two distinct bottlenecks and introduces reusable evaluation tools, including an oracle scale ladder and mask replacement methodology. Experiments demonstrate that current depth estimation and global calibration strategies yield limited performance gains; under distribution shifts, segmentation errors nearly nullify the benefits of ideal scale information, reducing performance back to baseline levels.
📝 Abstract
Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.