Unlocking Generalization in Polyp Segmentation with DINO Self-Attention "keys"

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing polyp segmentation models for colorectal cancer screening suffer from poor generalization, architectural redundancy, and heavy reliance on large-scale annotated data. Method: We propose a lightweight, task-agnostic segmentation framework leveraging the robustness and cross-domain consistency of “key” features from DINO-pretrained Vision Transformers (ViTs)—a property newly identified and exploited herein. Our approach eliminates deep token aggregation and task-specific networks, employing only a lightweight convolutional decoder to map key features directly to segmentation masks. Contribution/Results: The method significantly improves generalization across multi-center, few-shot, and cross-domain settings. Evaluated under both Domain Generalization and Extreme Single Domain Generalization protocols, it surpasses state-of-the-art methods (e.g., nnU-Net, UM-Net) on major multi-center benchmarks. We further provide systematic quantification of how DINO feature evolution influences downstream segmentation performance.

Technology Category

Application Category

📝 Abstract

Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.

Problem

Research questions and friction points this paper is trying to address.

Improves polyp segmentation generalization in data-scarce scenarios

Leverages DINO self-attention keys for robust segmentation without complex architectures

Validates performance using domain generalization protocols on multi-center datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages DINO self-attention key features for segmentation

Uses simple convolutional decoder for polyp mask prediction

Validated via multi-center domain generalization protocols

🔎 Similar Papers

PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph