🤖 AI Summary
This work addresses the heavy reliance on costly manual annotations and poor generalization in LiDAR point cloud moving object segmentation (MOS). We propose TOP, a self-supervised pretraining framework centered on a novel temporal overlap point prediction mechanism: leveraging motion consistency across adjacent frames to predict occupancy states of dynamic points, jointly optimized with current-frame occupancy reconstruction for fully unsupervised learning. To enable fair evaluation—particularly for small-sized and distant objects—we introduce the mIoU_obj metric, which mitigates bias from point-count imbalance. On nuScenes and SemanticKITTI, TOP achieves up to a 28.77% improvement over supervised baselines. Moreover, it significantly enhances cross-LiDAR configuration transferability and generalization to downstream tasks, demonstrating robustness beyond domain-specific supervision.
📝 Abstract
Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose extbf{T}emporal extbf{O}verlapping extbf{P}rediction ( extbf{TOP}), a self-supervised pre-training method that alleviate the labeling burden for MOS. extbf{TOP} explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called $ ext{mIoU}_{ ext{obj}}$ to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that extbf{TOP} outperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.