RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address incomplete scene understanding and limitations of rigid grid-based representations in 4D mmWave radar–monocular vision fusion for 3D object detection, this paper proposes RaGS: a novel framework that introduces 3D Gaussian splats as a unified, differentiable representation—replacing conventional BEV grids and instance proposal paradigms. RaGS achieves coarse-to-fine Gaussian field construction and dynamic foreground focusing via three core components: frustum-guided spatial initialization (FLI), multi-modal iterative aggregation (IMA), and multi-granularity Gaussian fusion (MGF). It enables sparse, adaptive resource allocation and cross-modal feature co-optimization. Evaluated on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes, RaGS achieves state-of-the-art performance, notably improving detection accuracy and robustness for small and long-range objects.

Technology Category

Application Category

📝 Abstract
4D millimeter-wave radar has emerged as a promising sensor for autonomous driving, but effective 3D object detection from both 4D radar and monocular images remains a challenge. Existing fusion approaches typically rely on either instance-based proposals or dense BEV grids, which either lack holistic scene understanding or are limited by rigid grid structures. To address these, we propose RaGS, the first framework to leverage 3D Gaussian Splatting (GS) as representation for fusing 4D radar and monocular cues in 3D object detection. 3D GS naturally suits 3D object detection by modeling the scene as a field of Gaussians, dynamically allocating resources on foreground objects and providing a flexible, resource-efficient solution. RaGS uses a cascaded pipeline to construct and refine the Gaussian field. It starts with the Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse 3D Gaussians positions. Then, the Iterative Multimodal Aggregation (IMA) fuses semantics and geometry, refining the limited Gaussians to the regions of interest. Finally, the Multi-level Gaussian Fusion (MGF) renders the Gaussians into multi-level BEV features for 3D object detection. By dynamically focusing on sparse objects within scenes, RaGS enable object concentrating while offering comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes benchmarks demonstrate its state-of-the-art performance. Code will be released.
Problem

Research questions and friction points this paper is trying to address.

Fusing 4D radar and monocular images for 3D detection
Overcoming rigid grid limitations in existing fusion methods
Dynamic resource allocation for efficient scene modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Gaussian Splatting for radar-image fusion
Dynamic Gaussian allocation for foreground objects
Cascaded pipeline refines multi-level BEV features
🔎 Similar Papers
No similar papers found.