From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Lightweight models for colonoscopic polyp segmentation suffer from poor generalization, while large-scale foundation models struggle to adapt to small-sample medical scenarios. Method: We propose Polyp-DiFoM, a novel knowledge distillation framework that introduces a dual-path distillation mechanism—integrating semantic prior guidance and frequency-domain encoding—to efficiently transfer knowledge from vision foundation models (e.g., SAM, DINOv2) to lightweight architectures (e.g., U-Net, U-Net++). It jointly leverages knowledge distillation, semantic feature alignment, and frequency-domain feature encoding to address challenges posed by morphological variability, strong camouflage, and scarce annotated data. Contribution/Results: Evaluated on five benchmark datasets, Polyp-DiFoM achieves state-of-the-art performance, improving average Dice score by 3.2–5.7 percentage points over prior methods while accelerating inference speed by approximately 9×—demonstrating superior accuracy-efficiency trade-off.

Technology Category

Application Category

📝 Abstract
Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.
Problem

Research questions and friction points this paper is trying to address.

Distilling large foundation models into lightweight baselines for polyp segmentation
Addressing domain gap between natural images and medical imaging tasks
Improving segmentation accuracy while reducing computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills foundation models into lightweight segmentation baselines
Infuses semantic priors from foundation models into canonical architectures
Performs frequency domain encoding for enhanced distillation
🔎 Similar Papers
No similar papers found.