BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address feature misalignment and degraded fusion performance between LiDAR and camera modalities in the bird’s-eye view (BEV) space—primarily caused by geometric inaccuracies, especially depth estimation errors—this paper proposes BEVDilation, a LiDAR-centric multimodal fusion framework. Our method introduces two novel components: a sparse voxel dilation block and a semantic-guided BEV dilation block, which leverage camera-derived BEV features as implicit spatial and semantic priors to adaptively and robustly diffuse image information onto sparse LiDAR features, thereby mitigating misalignment and compensating for LiDAR’s limited semantic expressiveness. By integrating BEV representation, sparse convolution, and context-aware aggregation, BEVDilation achieves state-of-the-art detection accuracy on nuScenes while maintaining low computational overhead and strong robustness against depth estimation noise.

Technology Category

Application Category

📝 Abstract

Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

Problem

Research questions and friction points this paper is trying to address.

Fuses LiDAR and camera data in BEV for 3D detection

Reduces spatial misalignment from image depth errors

Addresses point cloud sparsity and semantic limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritizes LiDAR data in multi-modal fusion

Uses image BEV features as implicit guidance

Introduces dilation blocks to densify sparse LiDAR data

🔎 Similar Papers

BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection