🤖 AI Summary
This paper addresses domain shift in monocular 3D object detection across heterogeneous sensors, environments, and camera configurations. To tackle this challenge, we propose MonoCT, an unsupervised domain adaptation framework. Our method introduces three key innovations: (1) a Generalized Depth Enhancement (GDE) module to improve depth estimation robustness under domain shifts; (2) Pseudo-Label Scoring (PLS) and Diversity Maximization (DM) strategies grounded in intrinsic camera model consistency, significantly enhancing pseudo-label reliability and coverage; and (3) a consistency-based teacher–student architecture enabling self-supervised pseudo-label learning. Evaluated on six major benchmarks—including KITTI and Waymo—MonoCT achieves an average ≥21% improvement in AP<sub>mod</sub>, substantially outperforming state-of-the-art methods. The framework demonstrates strong generalization across diverse deployment scenarios, including automotive, traffic surveillance, and UAV-mounted cameras, exhibiting exceptional cross-domain transfer capability.
📝 Abstract
We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.