🤖 AI Summary
In multi-rigid-camera scenarios—such as animal behavior analysis and forensic video authentication—strong radial distortion severely degrades the robustness of multi-view camera calibration. To address this, we propose a joint intrinsic-extrinsic calibration method tailored for dense feature matching. Our approach integrates (1) a structure-from-motion (SfM) framework enhanced with VGGT-based feature matching, adaptive optimal subsampling of correspondences, and incremental view selection; and (2) a distortion-aware pose initialization and global optimization pipeline. Evaluated on strongly distorted datasets, our method achieves a calibration success rate of 79.9%, substantially outperforming the VGGT baseline (40.4%). It supports diverse camera configurations—including fisheye, wide-angle, and catadioptric systems—and demonstrates practical deployability in real-world applications.
📝 Abstract
Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4 vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.