🤖 AI Summary
Existing approaches struggle to generate high-quality, coherent full-body motions from streaming audio—encompassing both speech and music—with low latency. This work proposes the first unified architecture for streaming audio-to-motion generation that operates without domain labels, continuously producing temporally coherent 3D character animations from incremental audio input. The method integrates reinforcement learning to optimize online motion quality, incorporates a large language model tool-calling interface for semantically controllable gesture synthesis, and employs an unsupervised domain generalization strategy to enhance robustness across diverse scenarios. Experimental results demonstrate that the system significantly outperforms current real-time methods in both motion fidelity and audio-motion synchronization, enabling low-latency, high-fidelity, and multi-scenario deployment of interactive virtual avatars.
📝 Abstract
Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/EchoAvatar-Page.