🤖 AI Summary
This work addresses the limitations of existing single-image 3D human reconstruction methods, which rely on rigid joint transformations and struggle to capture realistic clothing dynamics. To overcome this, we propose DynaAvatar, a novel framework that, for the first time, enables zero-shot generation of animatable 3D human avatars from a single image while accurately recovering motion-induced clothing deformations. Built upon a Transformer-based feedforward architecture, DynaAvatar directly predicts dynamic 3D Gaussian deformations without requiring test-time optimization. The method leverages static-to-dynamic knowledge transfer, lightweight LoRA fine-tuning, a DynaFlow optical flow-guided loss, and an SMPL-X re-annotation strategy to significantly enhance the fidelity of dynamic clothing modeling. Experiments demonstrate that DynaAvatar substantially outperforms current approaches in both visual realism and generalization capability.
📝 Abstract
Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow-guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods.