Zero-Shot 4D Lidar Panoptic Segmentation

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing embodied navigation approaches struggle with open-scene understanding due to the scarcity of large-scale, diverse LiDAR annotations, hindering zero-shot 4D (3D + time) point cloud panoptic segmentation. Method: We propose SAL-4D, the first zero-shot 4D LiDAR panoptic segmentation framework, which leverages cross-modal knowledge distillation to transfer sequence-level textual embeddings from video object segmentation (VOS) and vision-language models (e.g., CLIP) into the 4D LiDAR space. It integrates multi-sensor calibration and spatiotemporal alignment distillation to bridge modality gaps. Contribution/Results: SAL-4D requires no LiDAR annotations whatsoever. It significantly improves zero-shot 3D panoptic segmentation performance (+5 PQ) and, for the first time, enables arbitrary-category, temporally consistent zero-shot 4D LiDAR panoptic segmentation (4D-LPS), advancing open-vocabulary, time-aware perception for embodied agents.

Technology Category

Application Category

📝 Abstract
Zero-shot 4D segmentation and recognition of arbitrary objects in Lidar is crucial for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of annotations.To overcome these challenges, we propose SAL-4D (Segment Anything in Lidar--4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over $5$ PQ, and unlock Zero-Shot 4D-LPS.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot segmentation of arbitrary Lidar objects
Lack of diverse annotated datasets for Lidar
Improving 4D Lidar panoptic segmentation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Video Object Segmentation models
Employs Vision-Language foundation models
Lifts annotations to 4D Lidar space
🔎 Similar Papers
No similar papers found.
Y
Yushan Zhang
NVIDIA, Linköping University
A
Aljoša Ovsep
NVIDIA
Laura Leal-Taixé
Laura Leal-Taixé
Senior Research Manager at NVIDIA. Prev. Prof. TU Munich
Computer VisionMachine LearningDeep Learning
T
Tim Meinhardt
NVIDIA