🤖 AI Summary
Existing video-based human mesh recovery (HMR) methods suffer from motion jitter, structural distortion, and temporal inconsistency. To address these issues, this paper proposes the first generative HMR framework that explicitly incorporates human motion priors into both the forward and reverse processes of a diffusion model, enabling joint optimization of geometric accuracy and motion smoothness. Our approach integrates optical-flow-guided motion modeling, temporal self-attention mechanisms, and SMPL parameter decoding to significantly enhance reconstruction quality in dynamic scenes. Evaluated on Human3.6M and 3DPW benchmarks, our method achieves state-of-the-art performance: it reduces MPJPE by 12.3%, improves keypoint trajectory smoothness by 27%, and supports end-to-end real-time video reconstruction.
📝 Abstract
Human mesh recovery (HMR) provides rich human body information for various real-world applications. While image-based HMR methods have achieved impressive results, they often struggle to recover humans in dynamic scenarios, leading to temporal inconsistencies and non-smooth 3D motion predictions due to the absence of human motion. In contrast, video-based approaches leverage temporal information to mitigate this issue. In this paper, we present DiffMesh, an innovative motion-aware Diffusion-like framework for video-based HMR. DiffMesh establishes a bridge between diffusion models and human motion, efficiently generating accurate and smooth output mesh sequences by incorporating human motion within the forward process and reverse process in the diffusion model. Extensive experiments are conducted on the widely used datasets (Human3.6M cite{h36m_pami} and 3DPW cite{pw3d2018}), which demonstrate the effectiveness and efficiency of our DiffMesh. Visual comparisons in real-world scenarios further highlight DiffMesh's suitability for practical applications.