HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Existing methods for recovering 3D human geometry from monocular video face significant challenges: Vision Transformers (ViTs) tend to overfit to 2D viewpoints, while NeRF- and Gaussian Splatting–based avatars decouple pose and appearance, limiting generalization to novel poses. This work proposes HumanSplatHMR, the first approach to tightly couple human mesh recovery with Gaussian Splatting avatar modeling within a joint optimization framework. By leveraging differentiable rendering, photometric, segmentation, and depth losses are end-to-end backpropagated to pose parameters, enabling closed-loop optimization of both pose and appearance. Requiring neither motion capture nor offline refinement, the method substantially improves 3D pose accuracy in real-world scenes and enhances rendering fidelity under novel poses and viewpoints, outperforming existing decoupled baselines.

📝 Abstract

Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.

Problem

Research questions and friction points this paper is trying to address.

human mesh recovery

Gaussian Splatting

3D human pose

avatar reconstruction

novel-pose synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human Mesh Recovery

Gaussian Splatting

Differentiable Rendering