🤖 AI Summary
This work addresses the challenge that the inherent sparsity of 4D radar point clouds limits the performance of 3D object detection, while existing radar-camera fusion methods suffer from high computational complexity and are difficult to deploy in real-time systems. To overcome these limitations, the authors propose an efficient feature enhancement strategy that eschews intricate multimodal fusion and instead focuses on improving radar feature encoding. Specifically, they introduce a ray-centric Gaussian Point Encoder (PGE) that integrates ray-aligned coordinates with bird’s-eye-view (BEV) spatial transformation and injects image-derived semantic cues to enhance both geometric consistency and semantic richness. Evaluated on the View-of-Delft and TJ4DRadSet datasets, the proposed method achieves state-of-the-art performance in both accuracy and inference speed, establishing a new real-time benchmark for 4D radar-camera-based 3D object detection.
📝 Abstract
4D automotive radar is indispensable for autonomous driving due to its low cost and robustness, yet its point cloud sparsity challenges 3D object detection. Existing 4D radar-camera fusion methods focus on complex fusion strategies, trading inference speed for marginal gains. This trade-off hinders real-time deployment due to heavy computation on dense feature maps. In contrast, feature extraction from sparse radar points is less time-consuming but remains under-explored. This work uncovers that simply enhancing radar feature extraction can achieve comparable or even higher performance than elaborate fusion modules, while maintaining real-time performance. Based on this finding, we propose RCGDet3D, which centers on radar feature encoding and simplifies multi-modal fusion. Its encoder inherits from the efficient Gaussian Splatting-based Point Gaussian Encoder (PGE) in RadarGaussianDet3D with two key improvements. First, the Ray-centric PGE (R-PGE) predicts Gaussian attributes in ray-aligned coordinate systems before unifying them to Bird's-Eye View (BEV) space, significantly improving geometric consistency and reducing learning difficulty by decoupling the coordinate transformation from representation learning. Second, a Semantic Injection (SI) module incorporates visual cues from images, producing more geometrically accurate and semantically enriched radar features. Experiments on View-of-Delft (VoD) and TJ4DRadSet show that RCGDet3D outperforms state-of-the-art methods in both accuracy and speed, setting a new benchmark for real-time deployment.