๐ค AI Summary
This work addresses three key challenges in fisheye monocular depth estimation: the absence of ground-truth depth annotations, severe distortion-induced scale ambiguity, and training instability. We propose a real-scale self-supervised monocular depth estimation method specifically designed for fisheye cameras. Our approach explicitly embeds a differentiable fisheye camera model into the reprojection pipelineโmarking the first such integration. To resolve scale ambiguity, we replace network-predicted poses with geometrically calibrated, metric-scale poses derived from intrinsic and extrinsic calibration. Furthermore, we introduce a multi-scale adaptive feature fusion module to suppress pose estimation noise. By unifying fisheye geometric modeling, real-scale geometric constraints, and a self-supervised learning framework, our method achieves significant improvements in depth accuracy and robustness on public benchmarks and real-world fisheye sequences. It produces physically interpretable, metric-scale depth maps while simplifying both training and inference pipelines.
๐ Abstract
Accurate depth estimation is crucial for 3D scene comprehension in robotics and autonomous vehicles. Fisheye cameras, known for their wide field of view, have inherent geometric benefits. However, their use in depth estimation is restricted by a scarcity of ground truth data and image distortions. We present FisheyeDepth, a self-supervised depth estimation model tailored for fisheye cameras. We incorporate a fisheye camera model into the projection and reprojection stages during training to handle image distortions, thereby improving depth estimation accuracy and training stability. Furthermore, we incorporate real-scale pose information into the geometric projection between consecutive frames, replacing the poses estimated by the conventional pose network. Essentially, this method offers the necessary physical depth for robotic tasks, and also streamlines the training and inference procedures. Additionally, we devise a multi-channel output strategy to improve robustness by adaptively fusing features at various scales, which reduces the noise from real pose data. We demonstrate the superior performance and robustness of our model in fisheye image depth estimation through evaluations on public datasets and real-world scenarios. The project website is available at: https://github.com/guoyangzhao/FisheyeDepth.