UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
Existing methods for speech-driven virtual avatars struggle to simultaneously achieve effective multimodal fusion, high-quality generation, and real-time performance. This work proposes a unified sparse motion modeling framework that represents text, audio, and motion sequences as a common token stream for joint modeling. It innovatively integrates a spatially sparse mixture-of-experts mechanism with a temporally sparse keyframe-centric strategy to enable efficient temporal modeling. Through unified token representation, keyframe-driven sparse attention, and a multi-stage training pipeline enhanced with audio augmentation, the method significantly improves animation fidelity and temporal coherence under strict latency constraints. The approach achieves real-time, high-fidelity co-speech gesture generation, outperforming state-of-the-art methods in both quantitative metrics and qualitative evaluations.
📝 Abstract
Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.
Problem

Research questions and friction points this paper is trying to address.

speech-driven gestures
real-time avatars
multimodal motion modeling
high-fidelity animation
low-latency generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Sparse Motion Modeling
Mixture-of-Experts
Keyframe-Centric Design
Real-Time Co-Speech Avatars
Multimodal Tokenization
🔎 Similar Papers
No similar papers found.