🤖 AI Summary
To address the heavy reliance of millimeter-wave (mmWave) radar-based dense depth estimation on costly, densely annotated LiDAR supervision, this paper proposes the first learning framework requiring only ~1% sparse LiDAR point supervision. Methodologically, we introduce a novel radar point recalibration and fine-grained modeling pipeline, integrating radar geometric calibration, RGB-guided prior-driven monocular depth estimation, and cross-modal alignment to construct high-fidelity metric-scale depth priors as anchors for monocular prediction. To our knowledge, this is the first work achieving metric-scale depth estimation under purely sparse LiDAR supervision. On the ZJU-4DRadarCam and real-world vehicle datasets, our method reduces RMSE by 35.30% and 34.89%, respectively—significantly outperforming existing densely supervised approaches—while yielding superior object boundary delineation and texture detail reconstruction.
📝 Abstract
Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.