VoxelNextFusion: A Simple, Unified, and Effective Voxel Fusion Framework for Multimodal 3-D Object Detection

📅 2024-01-05
🏛️ IEEE Transactions on Geoscience and Remote Sensing
📈 Citations: 12
Influential: 0
📄 PDF
🤖 AI Summary
In LiDAR-camera fusion 3D detection, the sparsity of voxel features and the density of image features hinder cross-modal alignment and lead to semantic and continuity information loss—particularly for distant objects. To address this, we propose a voxel-space image feature processing framework: (1) image features are projected onto the voxel grid via voxelization; (2) pixel-level and patch-level multi-granularity features are extracted; (3) a cross-modal self-attention fusion mechanism enables fine-grained alignment of image features within the voxel space; and (4) a foreground-aware importance weighting module suppresses background interference. Evaluated on KITTI, our method achieves a +3.20% improvement in AP@0.7 for hard-category car detection over the Voxel R-CNN baseline. Furthermore, it demonstrates strong generalizability on nuScenes, validating its robustness across diverse driving scenarios and sensor configurations.

Technology Category

Application Category

📝 Abstract
Light detection and ranging (LiDAR)–camera fusion can enhance the performance of 3-D object detection by utilizing complementary information between depth-aware LiDAR points and semantically rich images. Existing voxel-based methods face significant challenges when fusing sparse voxel features with dense image features in a one-to-one manner, resulting in the loss of the advantages of images, including semantic and continuity information, leading to suboptimal detection performance, especially at long distances. In this article, we present VoxelNextFusion, a multimodal 3-D object detection framework specifically designed for voxel-based methods, which effectively bridges the gap between sparse point clouds and dense images. In particular, we propose a voxel-based image pipeline that involves projecting point clouds onto images to obtain both pixel- and patch-level features. These features are then fused using a self-attention to obtain a combined representation. Moreover, to address the issue of background features present in patches, we propose a feature importance module that effectively distinguishes between foreground and background features, thus minimizing the impact of the background features. Extensive experiments were conducted on the widely used KITTI and nuScenes 3-D object detection benchmarks. Notably, our VoxelNextFusion achieved around +3.20% in AP@0.7 improvement for car detection in hard level compared to the Voxel R-CNN baseline on the KITTI test dataset.
Problem

Research questions and friction points this paper is trying to address.

Enhance 3D object detection using LiDAR-camera fusion
Address challenges in fusing sparse voxel and dense image features
Improve detection performance, especially at long distances
Innovation

Methods, ideas, or system contributions that make the work stand out.

VoxelNextFusion bridges sparse LiDAR and dense images.
Uses self-attention for pixel- and patch-level feature fusion.
Feature importance module minimizes background feature impact.
🔎 Similar Papers
No similar papers found.
Ziying Song
Ziying Song
Beijing Jiaotong University
Object DetectionComputer VisionDeep Learning
Guoxin Zhang
Guoxin Zhang
School of Computer Science, Beijing University of Posts and Telecommunications
Computer VisionPattern Recognition
J
Jun Xie
Lenovo Research, Beijing 100085, China
L
Lin Liu
School of Computer and Information Technology, Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China
C
Caiyan Jia
School of Computer and Information Technology, Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China
Shaoqing Xu
Shaoqing Xu
University of Macau, BUAA, Xiaomi EV
3D Computer Vision3D GenerationVision and Language ModelEnd2EndWorld Model
Zhepeng Wang
Zhepeng Wang
Applied Scientist at Amazon Stores Foundational AI
Large Language ModelsOn-device AISelf-supervised LearningQuantum Machine Learning