๐ค AI Summary
This work addresses the challenges of joint human-background reconstruction and real-time interactive rendering from monocular video. We propose a decoupled dynamic scene modeling framework: (1) A texture-guided SMPL surface point cloud growth mechanism generates high-fidelity, position- and texture-driven human point clouds; (2) LBS weights enable hyperparameter-free densification and real-time deformation, supporting pose and viewpoint generalization as well as cross-species transfer; (3) Human and background are jointly optimized via Gaussian splatting, with geometry and appearance features predicted by a CNN. Experiments demonstrate superior reconstruction quality over HUGS, 50% reduction in training GPU memory consumption, and real-time rendering at over 100 FPSโapproximately six times faster than HUGS. To our knowledge, this is the first method enabling high-quality, real-time interactive editing and novel-view synthesis directly from monocular video input.
๐ Abstract
Reconstructing an interactive human avatar and the background from a monocular video of a dynamic human scene is highly challenging. In this work we adopt a strategy of point cloud decoupling and joint optimization to achieve the decoupled reconstruction of backgrounds and human bodies while preserving the interactivity of human motion. We introduce a position texture to subdivide the Skinned Multi-Person Linear (SMPL) body model's surface and grow the human point cloud. To capture fine details of human dynamics and deformations, we incorporate a convolutional neural network structure to predict human body point cloud features based on texture. This strategy makes our approach free of hyperparameter tuning for densification and efficiently represents human points with half the point cloud of HUGS. This approach ensures high-quality human reconstruction and reduces GPU resource consumption during training. As a result, our method surpasses the previous state-of-the-art HUGS in reconstruction metrics while maintaining the ability to generalize to novel poses and views. Furthermore, our technique achieves real-time rendering at over 100 FPS, $sim$6$ imes$ the HUGS speed using only Linear Blend Skinning (LBS) weights for human transformation. Additionally, this work demonstrates that this framework can be extended to animal scene reconstruction when an accurately-posed model of an animal is available.