Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
Existing self-supervised learning methods struggle to model fine-grained spatial relationships among object parts in images. To address this limitation, this work proposes a novel pretraining task called Spatial Prediction (SP), which explicitly injects spatial inductive bias by predicting the relative position and scale between two local views within the same image in a continuous geometric space—thereby enhancing the model’s understanding of compositional scene structure. SP features a decoupled, plug-in design that seamlessly integrates into various self-supervised frameworks and evaluates spatial reasoning through patch-pair prediction and jigsaw tasks. Experiments demonstrate consistent performance gains across multiple downstream tasks, including image recognition, fine-grained classification, semantic segmentation, and depth estimation, while significantly improving robustness on out-of-distribution data.
📝 Abstract
Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.
Problem

Research questions and friction points this paper is trying to address.

self-supervised learning
spatial structure
object parts
spatial relationships
visual representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial prediction
self-supervised learning
geometric awareness
compositional structure
spatial reasoning
🔎 Similar Papers
No similar papers found.