🤖 AI Summary
This work addresses the challenging problem of joint 3D human mesh reconstruction and camera pose estimation from uncalibrated multi-view videos—particularly under severe occlusion and textureless conditions. We propose an end-to-end differentiable framework that jointly optimizes human body meshes and camera parameters by introducing two novel components: (i) a pose-geometry consistency constraint enforcing geometric coherence across views and poses, and (ii) an implicit motion prior encoding natural human dynamics. Initialization leverages 2D human detections and keypoint heatmaps to estimate intrinsic and extrinsic camera parameters; subsequent optimization integrates cross-view person association, differentiable rendering, and motion modeling—requiring neither calibration targets nor background features. Evaluated on multiple public benchmarks, our method achieves single-stage reconstruction with 37% lower camera parameter error and an MPJPE of 58.3 mm, significantly outperforming conventional sequential approaches. The code is publicly available.
📝 Abstract
Dynamic multi-person mesh recovery has broad applications in sports broadcasting, virtual reality, and video games. However, current multi-view frameworks rely on a time-consuming camera calibration procedure. In this work, we focus on multi-person motion capture with uncalibrated cameras, which mainly faces two challenges: one is that inter-person interactions and occlusions introduce inherent ambiguities for both camera calibration and motion capture; the other is that a lack of dense correspondences can be used to constrain sparse camera geometries in a dynamic multi-person scene. Our key idea is to incorporate motion prior knowledge to simultaneously estimate camera parameters and human meshes from noisy human semantics. We first utilize human information from 2D images to initialize intrinsic and extrinsic parameters. Thus, the approach does not rely on any other calibration tools or background features. Then, a pose-geometry consistency is introduced to associate the detected humans from different views. Finally, a latent motion prior is proposed to refine the camera parameters and human motions. Experimental results show that accurate camera parameters and human motions can be obtained through a one-step reconstruction. The code are publicly available at https://github.com/boycehbz/DMMR.