Semantic Concentration for Self-Supervised Dense Representations Learning

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Image-level self-supervised learning (SSL) suffers from patch over-dispersion in dense representation learning, causing inconsistent local feature embeddings for the same instance or class and degrading performance on downstream dense prediction tasks. To address this, we propose an Explicit Semantic Focusing framework: (i) a non-rigid spatially aligned patch correspondence distillation mechanism to mitigate both inter-image and intra-instance dispersion; (ii) a noise-robust ranking loss coupled with an object-aware filter to uncover shared semantic patterns; and (iii) an object-centric feature space mapping via cross-attention and learnable object prototypes. Our method achieves significant improvements over state-of-the-art SSL approaches on dense prediction benchmarks—including semantic segmentation and depth estimation—demonstrating its effectiveness in enhancing representation consistency across local regions.

Technology Category

Application Category

📝 Abstract
Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.
Problem

Research questions and friction points this paper is trying to address.

Addresses over-dispersion in self-supervised dense patch representations
Proposes explicit semantic concentration for dense SSL methods
Develops noise-tolerant ranking loss and object-aware filtering techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distill patch correspondences to break spatial alignment
Propose noise-tolerant ranking loss for continuous targets
Object-aware filter maps output to object-based space
Peisong Wen
Peisong Wen
University of Chinese Academy of Sciences
machine learningcomputer vision
Q
Qianqian Xu
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China, and also with Peng Cheng Laboratory, Shenzhen 518055, China
Siran Dai
Siran Dai
Unknown affiliation
R
Runmin Cong
School of Control Science and Engineering and the Key Laboratory of Machine Intelligence and System Control, Ministry of Education, Shandong University, Jinan 250061, Shandong, China
Qingming Huang
Qingming Huang
University of the Chinese Academy of Sciences
Multimedia Analysis and RetrievalImage and Video ProcessingPattern RecognitionComputer VisionVideo Coding