ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses key challenges in speech-driven high-fidelity human motion generation—namely, poor audio–motion synchronization, limited cross-domain generalization, and error accumulation in autoregressive inference. To this end, we propose the Recurrent Embedded Transformer (RET), which jointly models spatiotemporal dependencies via recurrent embedding propagation. We further introduce Dynamic Embedding Regularization (DER) to enhance robustness across diverse speaker and motion domains. Additionally, we design Iterative Reconstruction Inference (IRI), integrating classifier-free guidance with temporal smoothing to ensure kinematic continuity and natural motion dynamics. Evaluated on standard benchmarks, our approach achieves state-of-the-art performance: the Fréchet Gesture Distance improves by 86.7% (from 18.70 to 2.48). Moreover, zero-shot speech-driven motion generation exhibits substantial gains in naturalness, coherence, and generalization capability.

Technology Category

Application Category

📝 Abstract

We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences via cyclic pose reconstruction, driven by two key components: (1) classifier-free guidance improves distribution alignment between generated and real gestures without auxiliary supervision, and (2) a temporal smoothing process eliminates abrupt inter-frame transitions while ensuring kinematic continuity. Extensive experiments on benchmark datasets validate ReCoM's effectiveness, achieving state-of-the-art performance across metrics. Notably, it reduces the Fr'echet Gesture Distance (FGD) from 18.70 to 2.48, demonstrating an 86.7% improvement in motion realism. Our project page is https://yong-xie-xy.github.io/ReCoM/.

Problem

Research questions and friction points this paper is trying to address.

Generating realistic human motions synchronized with speech

Enhancing gesture naturalness and fidelity via coherent motion synthesis

Improving zero-shot motion generation for unseen speech inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent Embedded Transformer for motion dynamics

Dynamic Embedding Regularization enhances robustness

Iterative reconstruction inference improves motion quality

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning