🤖 AI Summary
This paper addresses the spatiotemporal calibration challenge for multi-view video in dynamic, multi-person scenes. We propose an end-to-end, markerless calibration method leveraging freely moving pedestrians. By modeling cross-view human motion as probabilistic point-set registration on the unit sphere, our approach jointly estimates camera rotation, translation, inter-camera time offsets, and person-level cross-view correspondences. Our key contribution is a novel temporal-geometric joint optimization framework that integrates monocular 3D pose estimation, unit-sphere projection, soft assignment matching, coplanarity constraints, and multi-view consistency regularization. Evaluated on both synthetic and real-world datasets, the method achieves sub-degree rotational accuracy (<1°) and sub-hundred-millisecond temporal synchronization precision—significantly outperforming existing calibration-free approaches. The framework is robust to unconstrained human motion and requires no specialized calibration objects, making it suitable for flexible deployment in practical surveillance and human motion analysis applications.
📝 Abstract
We propose a novel method for spatiotemporal multi-camera calibration using freely moving people in multiview videos. Since calibrating multiple cameras and finding matches across their views are inherently interdependent, performing both in a unified framework poses a significant challenge. We address these issues as a single registration problem of matching two sets of 3D points, leveraging human motion in dynamic multi-person scenes. To this end, we utilize 3D human poses obtained from an off-the-shelf monocular 3D human pose estimator and transform them into 3D points on a unit sphere, to solve the rotation, time offset, and the association alternatingly. We employ a probabilistic approach that can jointly solve both problems of aligning spatiotemporal data and establishing correspondences through soft assignment between two views. The translation is determined by applying coplanarity constraints. The pairwise registration results are integrated into a multiview setup, and then a nonlinear optimization method is used to improve the accuracy of the camera poses, temporal offsets, and multi-person associations. Extensive experiments on synthetic and real data demonstrate the effectiveness and flexibility of the proposed method as a practical marker-free calibration tool.