🤖 AI Summary
Monocular absolute 3D human pose estimation faces two fundamental bottlenecks: heavy reliance on scarce and costly 3D ground-truth annotations, and inherent ambiguities in metric scale and global translation recovery from single-view observations. This paper introduces the first two-stage diffusion framework—2D pre-training followed by 3D fine-tuning—that substantially reduces dependence on 3D supervision, enabling high-fidelity 3D motion reconstruction using only large-scale 2D pose data. Key technical contributions include: (1) decoupled modeling of local joint kinematics and global root translation; (2) incorporation of ground-plane geometric priors to enforce physically plausible absolute positioning; and (3) integration of cross-view consistency constraints and geometry-aware motion encoding. Evaluated on real-world benchmarks, our method achieves a 21% reduction in absolute position error and a 17% improvement in action-level accuracy, while demonstrating significantly enhanced generalization and scalability.
📝 Abstract
Recovering absolute poses in the world coordinate system from monocular views presents significant challenges. Two primary issues arise in this context. Firstly, existing methods rely on 3D motion data for training, which requires collection in limited environments. Acquiring such 3D labels for new actions in a timely manner is impractical, severely restricting the model's generalization capabilities. In contrast, 2D poses are far more accessible and easier to obtain. Secondly, estimating a person's absolute position in metric space from a single viewpoint is inherently more complex. To address these challenges, we introduce Mocap-2-to-3, a novel framework that decomposes intricate 3D motions into 2D poses, leveraging 2D data to enhance 3D motion reconstruction in diverse scenarios and accurately predict absolute positions in the world coordinate system. We initially pretrain a single-view diffusion model with extensive 2D data, followed by fine-tuning a multi-view diffusion model for view consistency using publicly available 3D data. This strategy facilitates the effective use of large-scale 2D data. Additionally, we propose an innovative human motion representation that decouples local actions from global movements and encodes geometric priors of the ground, ensuring the generative model learns accurate motion priors from 2D data. During inference, this allows for the gradual recovery of global movements, resulting in more plausible positioning. We evaluate our model's performance on real-world datasets, demonstrating superior accuracy in motion and absolute human positioning compared to state-of-the-art methods, along with enhanced generalization and scalability. Our code will be made publicly available.