🤖 AI Summary
This paper addresses three key challenges in speech-driven virtual human motion generation: insufficient personality expression, weak cross-modal alignment, and poor dynamic coherence. Methodologically, it integrates variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion models; incorporates multimodal alignment—spanning speech-to-motion and text-to-pose—and employs diverse motion representations including keypoints, neural radiance fields (NeRF), and skeletal dynamics, under a unified evaluation protocol. Its contributions include: (1) the first holistic framework jointly modeling facial and body motion generation; (2) a novel realism-coherence-expressiveness triadic evaluation framework tailored for dyadic interaction; and (3) an open-source, standardized benchmark platform covering 100+ methods and 30+ datasets. Experimental results establish a reproducible baseline, identify six concrete future research directions, and significantly advance the practical deployment of high-fidelity, personalized, and low-latency virtual human motion synthesis.
📝 Abstract
Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on https://lownish23csz0010.github.io/mogen/.