🤖 AI Summary
This work addresses key challenges in speech-driven high-fidelity human motion generation—namely, poor audio–motion synchronization, limited cross-domain generalization, and error accumulation in autoregressive inference. To this end, we propose the Recurrent Embedded Transformer (RET), which jointly models spatiotemporal dependencies via recurrent embedding propagation. We further introduce Dynamic Embedding Regularization (DER) to enhance robustness across diverse speaker and motion domains. Additionally, we design Iterative Reconstruction Inference (IRI), integrating classifier-free guidance with temporal smoothing to ensure kinematic continuity and natural motion dynamics. Evaluated on standard benchmarks, our approach achieves state-of-the-art performance: the Fréchet Gesture Distance improves by 86.7% (from 18.70 to 2.48). Moreover, zero-shot speech-driven motion generation exhibits substantial gains in naturalness, coherence, and generalization capability.
📝 Abstract
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences via cyclic pose reconstruction, driven by two key components: (1) classifier-free guidance improves distribution alignment between generated and real gestures without auxiliary supervision, and (2) a temporal smoothing process eliminates abrupt inter-frame transitions while ensuring kinematic continuity. Extensive experiments on benchmark datasets validate ReCoM's effectiveness, achieving state-of-the-art performance across metrics. Notably, it reduces the Fr'echet Gesture Distance (FGD) from 18.70 to 2.48, demonstrating an 86.7% improvement in motion realism. Our project page is https://yong-xie-xy.github.io/ReCoM/.