Object Concepts Emerge from Motion

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unsupervised visual representation learning for object-centric perception, inspired by infant cognitive development. Methodologically, it introduces the first framework that jointly leverages optical flow estimation and motion boundary clustering to generate pseudo-instance masks, which supervise end-to-end contrastive learning—without any human annotations, camera calibration, or geometric priors. Its core contribution lies in using motion boundaries as intrinsic, label-free supervisory signals to enable truly annotation-agnostic “visual instance” abstraction, thereby bridging a critical gap in foundation models’ capability for instance-level representation. Evaluated on monocular depth estimation, 3D object detection, and occupancy prediction, the approach significantly outperforms both supervised and self-supervised baselines while demonstrating strong cross-scene generalization.

Technology Category

Application Category

📝 Abstract
Object concepts play a foundational role in human visual cognition, enabling perception, memory, and interaction in the physical world. Inspired by findings in developmental neuroscience - where infants are shown to acquire object understanding through observation of motion - we propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner. Our key insight is that motion boundary serves as a strong signal for object-level grouping, which can be used to derive pseudo instance supervision from raw videos. Concretely, we generate motion-based instance masks using off-the-shelf optical flow and clustering algorithms, and use them to train visual encoders via contrastive learning. Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data. We evaluate our approach on three downstream tasks spanning both low-level (monocular depth estimation) and high-level (3D object detection and occupancy prediction) vision. Our models outperform previous supervised and self-supervised baselines and demonstrate strong generalization to unseen scenes. These results suggest that motion-induced object representations offer a compelling alternative to existing vision foundation models, capturing a crucial but overlooked level of abstraction: the visual instance. The corresponding code will be released upon paper acceptance.
Problem

Research questions and friction points this paper is trying to address.

Learning object-centric representations from motion cues unsupervised
Using motion boundaries for object-level grouping without labels
Evaluating motion-based models on depth and 3D vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised motion-based object representation learning
Motion boundary for object-level grouping
Label-free contrastive learning from videos
🔎 Similar Papers
No similar papers found.
H
Haoqian Liang
Beijing University of Posts and Telecommunications
X
Xiaohui Wang
Beijing University of Posts and Telecommunications
Z
Zhichao Li
Xiaomi EV
Ya Yang
Ya Yang
Beijing University of Posts and Telecommunications
Naiyan Wang
Naiyan Wang
Xiaomi EV
Machine LearningComputer VisionAutonomous Driving