π€ AI Summary
Existing methods for single- and multi-view video-driven animatable 3D human modeling struggle to accurately reconstruct non-rigid dynamics of loose clothing, often yielding geometric distortions, temporal inconsistencies, and oversmoothing. This work proposes an efficient 3D Gaussian Splatting (3DGS)-based framework featuring two key innovations: a motion-trend module and an implicit skeletal encoderβboth enabling explicit disentanglement of pose-dependent deformation from temporal evolution, thereby overcoming modeling bottlenecks imposed by global pose conditioning. Leveraging multi-view video supervision, our approach achieves geometrically consistent and temporally stable high-fidelity clothing deformation reconstruction. Evaluated on standard benchmarks, the method significantly improves structural fidelity and perceptual quality in non-rigid regions: clothing details appear more realistic, and inter-frame consistency is substantially enhanced.
π Abstract
Modeling animatable human avatars from monocular or multi-view videos has been widely studied, with recent approaches leveraging neural radiance fields (NeRFs) or 3D Gaussian Splatting (3DGS) achieving impressive results in novel-view and novel-pose synthesis. However, existing methods often struggle to accurately capture the dynamics of loose clothing, as they primarily rely on global pose conditioning or static per-frame representations, leading to oversmoothing and temporal inconsistencies in non-rigid regions. To address this, We propose RealityAvatar, an efficient framework for high-fidelity digital human modeling, specifically targeting loosely dressed avatars. Our method leverages 3D Gaussian Splatting to capture complex clothing deformations and motion dynamics while ensuring geometric consistency. By incorporating a motion trend module and a latentbone encoder, we explicitly model pose-dependent deformations and temporal variations in clothing behavior. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach in capturing fine-grained clothing deformations and motion-driven shape variations. Our method significantly enhances structural fidelity and perceptual quality in dynamic human reconstruction, particularly in non-rigid regions, while achieving better consistency across temporal frames.