π€ AI Summary
This work addresses the challenge of achieving both temporal coherence in full-body motion and high-fidelity hand detail in monocular video-based human mesh reconstruction. The authors propose a temporally consistent full-body reconstruction framework that, for the first time, jointly optimizes SMPL-X body and hand motions within a unified architecture. By introducing a residual-based bodyβhand feature fusion mechanism, the method simultaneously preserves stable body dynamics and recovers fine-grained hand poses. Additionally, a close-up-aware data augmentation strategy is incorporated to enhance robustness under upper-body-centric framing. Experiments demonstrate that the approach achieves superior hand accuracy while maintaining competitive body pose precision on both full-body and body-only benchmarks, producing temporally stable SMPL-X sequences that align well with 2D observations in complex real-world scenarios.
π Abstract
Monocular video human mesh recovery is essential for digital humans, avatar animation, and embodied simulation, where both temporal stability and expressive whole-body motion are required. Existing video HMR methods produce coherent body motion but often overlook detailed hand articulation, while image-based whole-body methods recover SMPL-X meshes independently per frame, often leading to jittery and inaccurate hand motion. We present a temporally coherent whole-body HMR framework for challenging in-the-wild monocular videos. Our model unifies body context and part-specific hand observations through residual body-hand fusion, enabling stable body motion and detailed hand recovery within a single temporal architecture. We further introduce close-up-aware augmentation to improve robustness under upper-body framing. Experiments on whole-body and body-only benchmarks demonstrate improved hand reconstruction and competitive body accuracy. Our method also produces temporally stable and 2D-consistent SMPL-X motion in challenging real-world videos.