🤖 AI Summary
This work addresses the challenge of scarce precise annotations in colonoscopy videos by proposing a temporally self-supervised representation learning method for polyps. Leveraging the inherent temporal structure of colonoscopy procedures, the approach automatically generates correspondence signals and introduces a noise-aware contrastive loss to mitigate interference from noisy labels. To the best of our knowledge, this is the first study to incorporate noisy temporal self-supervision into polyp representation learning. Despite training on only 27 videos with a lightweight model, the method consistently outperforms existing self-supervised and supervised baselines across multiple downstream tasks—including polyp retrieval, re-identification, size estimation, and histological classification—achieving performance comparable to or even surpassing that of state-of-the-art foundation models.
📝 Abstract
Learning robust representations of polyp tracklets is key to enabling multiple AI-assisted colonoscopy applications, from polyp characterization to automated reporting and retrieval. Supervised contrastive learning is an effective approach for learning such representations, but it typically relies on correct positive and negative definitions. Collecting these labels requires linking tracklets that depict the same underlying polyp entity throughout the video, which is costly and demands specialized clinical expertise. In this work, we leverage the sequential workflow of colonoscopy procedures to derive self-supervised associations from temporal structure. Since temporally derived associations are not guaranteed to be correct, we introduce a noise-aware contrastive loss to account for noisy associations. We demonstrate the effectiveness of the learned representations across multiple downstream tasks, including polyp retrieval and re-identification, size estimation, and histology classification. Our method outperforms prior self-supervised and supervised baselines, and matches or exceeds recent foundation models across all tasks, using a lightweight encoder trained on only 27 videos. Code is available at https://github.com/lparolari/ntssl.