LiveGesture Streamable Co-Speech Gesture Generation Model

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations of existing co-speech gesture generation methods, which are predominantly offline and thus incapable of supporting zero-latency, arbitrary-length real-time full-body motion synthesis, while also failing to effectively model dynamic interdependencies across body regions. To overcome these challenges, we propose the first fully streamable framework for co-speech full-body gesture generation, integrating a Streamable Vector-Quantized Motion Tokenizer (SVQ), a Hierarchical Autoregressive Transformer (HAR), and a causal Spatio-Temporal Fusion module (xAR Fusion) to achieve fine-grained, cross-regional motion coordination. Robustness is further enhanced through uncertainty-guided token masking and stochastic region masking during training. Evaluated on the BEAT2 dataset under zero-lookahead conditions, our method generates temporally coherent, diverse, rhythmically synchronized full-body gestures in real time, matching or even surpassing the performance of state-of-the-art offline approaches.

Technology Category

Application Category

📝 Abstract

We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods, which are designed for offline generation and either treat body regions independently or entangle all joints within a single model, LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-expert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero look-ahead conditions.

Problem

Research questions and friction points this paper is trying to address.

co-speech gesture generation

streamable model

zero look-ahead

full-body motion

real-time generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

streamable gesture generation

causal motion modeling

region-coordinated autoregression