MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

📅 2024-05-23

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

270K/year

🤖 AI Summary

To address the challenges of irregular data structure, inter-frame temporal inconsistency, and high computational cost in long-sequence 4D point cloud video modeling, this paper proposes the first pure 4D point cloud video understanding backbone based on state space models (SSMs). Our method decouples spatiotemporal modeling via two dedicated modules: an intra-frame spatial Mamba for local geometric structure capture and an inter-frame temporal Mamba for long-range temporal dependency modeling. Crucially, we replace quadratic-complexity self-attention with linear-complexity SSMs to enable efficient tokenization and sequence modeling. Experiments demonstrate significant improvements: +10.4% accuracy on MSR-Action3D, +0.7 F1-score on HOI4D, and +0.19 mIoU on Synthia4D. Moreover, for long-sequence inference, our approach reduces GPU memory consumption by 87.5% and accelerates inference by 5.36× compared to baseline methods.

Technology Category

Application Category

📝 Abstract

Point cloud videos can faithfully capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing world. However, designing an effective 4D backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). Specifically, we first disentangle space and time in 4D video sequences and then establish the spatio-temporal correlation with our designed Mamba blocks. The Intra-frame Spatial Mamba module is developed to encode locally similar geometric structures within a certain temporal stride. Subsequently, locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which integrates long-term point features across the entire video with linear complexity. Our proposed Mamba4d achieves competitive performance on the MSR-Action3D action recognition (+10.4% accuracy), HOI4D action segmentation (+0.7 F1 Score), and Synthia4D semantic segmentation (+0.19 mIoU) datasets. Especially, for long video sequences, our method has a significant efficiency improvement with 87.5% GPU memory reduction and 5.36 times speed-up. Codes will be released at https://github.com/IRMVLab/Mamba4D.

Problem

Research questions and friction points this paper is trying to address.

Efficient point cloud video understanding

Reducing computational costs in 4D backbones

Handling temporal inconsistencies in point clouds

Innovation

Methods, ideas, or system contributions that make the work stand out.

State Space Models backbone

Disentangled spatial-temporal correlation

Linear complexity long-term integration

🔎 Similar Papers

No similar papers found.