🤖 AI Summary
Monocular depth estimation (MDE) yields scale-ambiguous depth maps, hindering direct deployment in downstream tasks. To address this, we propose a low-cost radar-guided nonlinear calibration method. Unlike conventional linear scaling or complex multimodal architectures, our approach leverages radar–vision cross-modal fusion to regress piecewise polynomial coefficients. Crucially, we introduce first-order derivative regularization and monotonicity constraints—novel in MDE—to enable local, adaptive depth correction with inflection points. This design effectively alleviates multi-region depth misalignment and transcends the limitations of linear paradigms. Evaluated on three major benchmarks—including nuScenes—our method achieves state-of-the-art performance: mean absolute error (MAE) improves by 30.3% and root mean square error (RMSE) by 37.2% over prior methods.
📝 Abstract
We propose PolyRad, a novel radar-guided depth estimation method that introduces polynomial fitting to transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a simple yet fundamental insight: using polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust depth predictions non-uniformly across depth ranges. Although MDE models often infer reasonably accurate local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale-and-shift transformation insufficient given three or more of these regions. In contrast, PolyRad generalizes beyond linear transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces monotonicity via first-derivative regularization. PolyRad achieves state-of-the-art performance on the nuScenes, ZJU-4DRadarCam, and View-of-Delft datasets, outperforming existing methods by 30.3% in MAE and 37.2% in RMSE.