🤖 AI Summary
This paper addresses the challenge of jointly estimating human poses and reconstructing photorealistic 3D avatars from multi-view video. We propose an end-to-end joint optimization framework. Methodologically, we introduce the first personalized mesh-bound animatable 3D Gaussian representation, coupled with a time-varying MLP to model skeletal motion—enabling kinematically consistent spatiotemporal pose estimation and joint geometry-appearance optimization. Compared to conventional sequential pipelines, our approach reduces full-body joint error by 35% and hand joint error by 45% on a yoga dataset, while improving novel-view synthesis PSNR by 2 dB, significantly enhancing motion capture accuracy and photorealistic real-time rendering. Our core contributions are: (1) the first drivable, personalized 3D Gaussian human representation; and (2) a novel paradigm for joint optimization of motion and geometry.
📝 Abstract
We present Better Together, a method that simultaneously solves the human pose estimation problem while reconstructing a photorealistic 3D human avatar from multi-view videos. While prior art usually solves these problems separately, we argue that joint optimization of skeletal motion with a 3D renderable body model brings synergistic effects, i.e. yields more precise motion capture and improved visual quality of real-time rendering of avatars. To achieve this, we introduce a novel animatable avatar with 3D Gaussians rigged on a personalized mesh and propose to optimize the motion sequence with time-dependent MLPs that provide accurate and temporally consistent pose estimates. We first evaluate our method on highly challenging yoga poses and demonstrate state-of-the-art accuracy on multi-view human pose estimation, reducing error by 35% on body joints and 45% on hand joints compared to keypoint-based methods. At the same time, our method significantly boosts the visual quality of animatable avatars (+2dB PSNR on novel view synthesis) on diverse challenging subjects.