🤖 AI Summary
Existing camera calibration methods are often constrained by controlled environments or single-view assumptions, limiting their ability to model multi-view geometric consistency in real-world scenarios. This work proposes CalibAnyView, a unified framework that, for the first time, enables joint estimation of intrinsic camera parameters and gravity direction from an arbitrary number of input views. The approach integrates a multi-view Transformer, dense perspective field prediction, and geometric optimization, while introducing a heterogeneous lens distortion model. Alongside the method, the authors introduce the first large-scale multi-view outdoor video dataset. Experiments demonstrate that CalibAnyView outperforms state-of-the-art methods in both single- and multi-view settings, significantly enhancing accuracy and robustness in downstream tasks such as 3D reconstruction and robotic perception.
📝 Abstract
Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.