LiteFusion: Taming 3D Object Detectors from Vision-Based to Multi-Modal with Minimal Adaptation

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing multimodal 3D detection methods suffer from poor robustness due to heavy reliance on LiDAR and face deployment challenges on heterogeneous hardware (e.g., NPUs/FPGAs) owing to the computational inefficiency of mainstream 3D sparse convolutions. To address these issues, this paper proposes a lightweight camera-LiDAR fusion framework that eliminates both the conventional 3D sparse convolution backbone and a separate LiDAR encoder. Instead, it introduces a novel LiDAR geometry-guided quaternion-based cross-modal fusion mechanism, leveraging point cloud geometric priors as complementary enhancements to camera features. The framework incorporates geometry-aware alignment, point cloud projection enhancement, and a purely 2D backbone architecture. On nuScenes, it achieves +20.4% mAP and +19.7% NDS over a pure-vision baseline with only +1.1% parameter overhead. Notably, it maintains strong performance even without LiDAR input, significantly improving deployment flexibility and system robustness.

Technology Category

Application Category

📝 Abstract

3D object detection is fundamental for safe and robust intelligent transportation systems. Current multi-modal 3D object detectors often rely on complex architectures and training strategies to achieve higher detection accuracy. However, these methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent, which compromises the robustness and safety of autonomous systems in practical scenarios. Moreover, existing multi-modal detectors face difficulties in deployment on diverse hardware platforms, such as NPUs and FPGAs, due to their reliance on 3D sparse convolution operators, which are primarily optimized for NVIDIA GPUs. To address these challenges, we reconsider the role of LiDAR in the camera-LiDAR fusion paradigm and introduce a novel multi-modal 3D detector, LiteFusion. Instead of treating LiDAR point clouds as an independent modality with a separate feature extraction backbone, LiteFusion utilizes LiDAR data as a complementary source of geometric information to enhance camera-based detection. This straightforward approach completely eliminates the reliance on a 3D backbone, making the method highly deployment-friendly. Specifically, LiteFusion integrates complementary features from LiDAR points into image features within a quaternion space, where the orthogonal constraints are well-preserved during network training. This helps model domain-specific relations across modalities, yielding a compact cross-modal embedding. Experiments on the nuScenes dataset show that LiteFusion improves the baseline vision-based detector by +20.4% mAP and +19.7% NDS with a minimal increase in parameters (1.1%) without using dedicated LiDAR encoders. Notably, even in the absence of LiDAR input, LiteFusion maintains strong results , highlighting its favorable robustness and effectiveness across diverse fusion paradigms and deployment scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhances camera-based 3D detection using LiDAR as geometric complement

Eliminates reliance on 3D backbone for better hardware deployment flexibility

Maintains robustness when LiDAR is absent to improve system safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LiDAR as geometric enhancement for camera detection

Eliminates 3D backbone for deployment-friendly design

Integrates features in quaternion space for cross-modal embedding

🔎 Similar Papers

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection