🤖 AI Summary
To address temporal inconsistency in 4D semantic segmentation of multi-frame LiDAR point clouds, this paper proposes a spatiotemporally consistent dual-branch network. It explicitly models foreground object cluster priors to generate temporally stable cluster-level labels; introduces a point-cluster adaptive weighting fusion mechanism to jointly optimize point-level and cluster-level features; and incorporates a neighboring-cluster cross-frame merging strategy to mitigate feature incompleteness caused by occlusion. The method achieves significant improvements in motion-object segmentation consistency on SemanticKITTI and nuScenes, attaining state-of-the-art performance on both multi-frame semantic segmentation and motion-object segmentation benchmarks. Its core innovation lies in being the first to integrate explicit cluster prior modeling with adaptive point-cluster fusion into a 4D segmentation framework, thereby substantially enhancing spatiotemporal semantic consistency.
📝 Abstract
Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets.