UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the imbalance between local and global contextual modeling in LiDAR point cloud 3D detection—caused by voxel-based serialization in Transformers, which distorts spatial structure, restricts receptive fields, and incurs high computational complexity—this paper proposes a unified architecture integrating 3D submanifold convolution with state space models (SSMs). Its core innovation is the UniMamba module, which jointly models spatial and channel dimensions: it employs dynamic submanifold convolutions to preserve sparse geometric structure, Z-order bidirectional serialization to maintain spatial locality, and grouped multi-head SSMs for efficient long-range dependency modeling. Built upon a hierarchical encoder-decoder framework, the method achieves 70.2 mAP on nuScenes, substantially outperforming existing Transformer- and CNN-based approaches. Strong generalization is further validated on Waymo Open Dataset and Argoverse 2.

Technology Category

Application Category

📝 Abstract

Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform"local and global"spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both"local and global"spatial inter-dependencies using multi-head SSM. Additionally, an encoder-decoder architecture with stacked UniMamba blocks is formed to facilitate multi-scale spatial learning hierarchically. Extensive experiments are conducted on three popular datasets: nuScenes, Waymo and Argoverse 2. Particularly, our UniMamba achieves 70.2 mAP on the nuScenes dataset.

Problem

Research questions and friction points this paper is trying to address.

Improves 3D object detection by preserving spatial structure during serialization.

Addresses limited receptive field in Transformers by grouping sequences efficiently.

Integrates 3D convolution and State Space Models for local-global context aggregation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

UniMamba integrates 3D convolution and State Space Models.

Uses Z-order serialization for efficient spatial context aggregation.

Adopts multi-head SSM for local-global spatial dependencies.

🔎 Similar Papers

No similar papers found.