🤖 AI Summary
Existing foundation monocular depth estimators (FMDEs) suffer significant performance degradation on fisheye images due to covariate shift induced by camera calibration discrepancies. To address this, we propose a zero-shot transfer method that requires no retraining or fine-tuning: a lightweight calibration token is introduced to conditionally modulate latent-space features, enabling implicit alignment between perspective and fisheye feature distributions—without resorting to explicit reprojection or image-domain mapping, thereby avoiding associated artifacts and information loss. Our approach operates within a self-supervised framework, leveraging perspective images and their synthetically fisheye-reprojected counterparts to learn the calibration token via depth consistency constraints. Experiments demonstrate that a single learned token consistently outperforms state-of-the-art methods across diverse indoor and outdoor fisheye scenes, while maintaining compatibility with various FMDE backbones and preserving real-time inference efficiency.
📝 Abstract
We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.