🤖 AI Summary
To address model redundancy, slow inference, and poor cross-domain generalization in high-resolution depth estimation for real-world scenarios, this paper proposes PatchRefiner V2—a lightweight architecture. Methodologically, it introduces (1) a novel coarse-to-fine guided denoising module that replaces heavy refiners with a compact encoder; (2) a noise-aware pretraining strategy to enhance robustness against input perturbations; and (3) a scale- and shift-invariant gradient-matching loss (SSIGM) to preserve boundary details and improve domain transferability. Evaluated on UnrealStereo4K, PatchRefiner V2 achieves state-of-the-art accuracy and inference speed while reducing parameter count by up to 60% and accelerating runtime by over 2×. On CityScapes, ScanNet++, and KITTI, it consistently improves boundary fidelity and cross-dataset generalization, demonstrating effective co-optimization of model efficiency and domain adaptability.
📝 Abstract
While current high-resolution depth estimation methods achieve strong results, they often suffer from computational inefficiencies due to reliance on heavyweight models and multiple inference steps, increasing inference time. To address this, we introduce PatchRefiner V2 (PRV2), which replaces heavy refiner models with lightweight encoders. This reduces model size and inference time but introduces noisy features. To overcome this, we propose a Coarse-to-Fine (C2F) module with a Guided Denoising Unit for refining and denoising the refiner features and a Noisy Pretraining strategy to pretrain the refiner branch to fully exploit the potential of the lightweight refiner branch. Additionally, we introduce a Scale-and-Shift Invariant Gradient Matching (SSIGM) loss to enhance synthetic-to-real domain transfer. PRV2 outperforms state-of-the-art depth estimation methods on UnrealStereo4K in both accuracy and speed, using fewer parameters and faster inference. It also shows improved depth boundary delineation on real-world datasets like CityScape, ScanNet++, and KITTI, demonstrating its versatility across domains.