🤖 AI Summary
To address the real-time processing and resource-constrained challenges in robotic dynamic 4D (3D spatial + temporal) environment perception using streaming point cloud video, this paper proposes a lightweight 4D point cloud video backbone supporting both online streaming and offline batch inference. Methodologically, it introduces: (1) a hybrid Mamba-Transformer temporal fusion module that achieves linear computational complexity while preserving bidirectional contextual modeling; and (2) a frame-level masked autoregressive pretraining strategy, 4DMAP, integrating 4D spatiotemporal encoding with self-supervised learning. Evaluated across seven datasets and nine downstream tasks, the method consistently achieves state-of-the-art performance. Notably, it enables substantial advances in 4D diffusion-based policy learning and imitation learning systems on the RoboTwin and HandoverSim benchmarks—demonstrating improved generalization, efficiency, and scalability for real-world robotic perception and control.
📝 Abstract
Understanding dynamic 4D environments-3D space evolving over time-is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D's utility by building two robotic application systems: 4D Diffusion Policy and 4D Imitation Learning, achieving substantial gains on the RoboTwin and HandoverSim benchmarks.