🤖 AI Summary
This work addresses the inherent tension in 3D object detection from point clouds between the sparsity of distant points and the need for effective global context modeling. To this end, the authors propose 3DTMDet, a novel dual-path architecture that synergistically integrates a state space model (Mamba) with a local-attention Transformer. The core component, the 3D Hybrid Mamba Transformer (3DHMT) module, efficiently captures long-range dependencies among sparse, distant points while preserving fine-grained local geometric details. Furthermore, a LiDAR-aware voxel feature diffusion mechanism is introduced to enhance remote representations by propagating features along the sensor’s radial direction. Evaluated on the KITTI and ONCE benchmarks, the proposed method significantly outperforms current state-of-the-art approaches, demonstrating notable improvements in detecting distant and small objects.
📝 Abstract
A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM's linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at https://github.com/QiuBingwen/3DTMDet.