LiteFusion: Taming 3D Object Detectors from Vision-Based to Multi-Modal with Minimal Adaptation

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal 3D detection methods suffer from poor robustness due to heavy reliance on LiDAR and face deployment challenges on heterogeneous hardware (e.g., NPUs/FPGAs) owing to the computational inefficiency of mainstream 3D sparse convolutions. To address these issues, this paper proposes a lightweight camera-LiDAR fusion framework that eliminates both the conventional 3D sparse convolution backbone and a separate LiDAR encoder. Instead, it introduces a novel LiDAR geometry-guided quaternion-based cross-modal fusion mechanism, leveraging point cloud geometric priors as complementary enhancements to camera features. The framework incorporates geometry-aware alignment, point cloud projection enhancement, and a purely 2D backbone architecture. On nuScenes, it achieves +20.4% mAP and +19.7% NDS over a pure-vision baseline with only +1.1% parameter overhead. Notably, it maintains strong performance even without LiDAR input, significantly improving deployment flexibility and system robustness.

Technology Category

Application Category

📝 Abstract
3D object detection is fundamental for safe and robust intelligent transportation systems. Current multi-modal 3D object detectors often rely on complex architectures and training strategies to achieve higher detection accuracy. However, these methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent, which compromises the robustness and safety of autonomous systems in practical scenarios. Moreover, existing multi-modal detectors face difficulties in deployment on diverse hardware platforms, such as NPUs and FPGAs, due to their reliance on 3D sparse convolution operators, which are primarily optimized for NVIDIA GPUs. To address these challenges, we reconsider the role of LiDAR in the camera-LiDAR fusion paradigm and introduce a novel multi-modal 3D detector, LiteFusion. Instead of treating LiDAR point clouds as an independent modality with a separate feature extraction backbone, LiteFusion utilizes LiDAR data as a complementary source of geometric information to enhance camera-based detection. This straightforward approach completely eliminates the reliance on a 3D backbone, making the method highly deployment-friendly. Specifically, LiteFusion integrates complementary features from LiDAR points into image features within a quaternion space, where the orthogonal constraints are well-preserved during network training. This helps model domain-specific relations across modalities, yielding a compact cross-modal embedding. Experiments on the nuScenes dataset show that LiteFusion improves the baseline vision-based detector by +20.4% mAP and +19.7% NDS with a minimal increase in parameters (1.1%) without using dedicated LiDAR encoders. Notably, even in the absence of LiDAR input, LiteFusion maintains strong results , highlighting its favorable robustness and effectiveness across diverse fusion paradigms and deployment scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhances camera-based 3D detection using LiDAR as geometric complement
Eliminates reliance on 3D backbone for better hardware deployment flexibility
Maintains robustness when LiDAR is absent to improve system safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LiDAR as geometric enhancement for camera detection
Eliminates 3D backbone for deployment-friendly design
Integrates features in quaternion space for cross-modal embedding
Xiangxuan Ren
Xiangxuan Ren
Shanghai Jiao Tong University
Computer Vision
Zhongdao Wang
Zhongdao Wang
Noah's Ark Lab, Huawei
computer visionautonomous driving
Pin Tang
Pin Tang
Shanghai Jiao Tong University
Computer VisionAutonomous DrivingMedical Image Analysis
G
Guoqing Wang
China Ministry of Education (MOE) Key Laboratory of Artificial Intelligence, Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai 200240, China
J
Jilai Zheng
China Ministry of Education (MOE) Key Laboratory of Artificial Intelligence, Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai 200240, China
C
Chao Ma
China Ministry of Education (MOE) Key Laboratory of Artificial Intelligence, Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai 200240, China