Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven humanoid motion methods rely on explicit motion reconstruction and retargeting, leading to cascaded errors, high latency, and misalignment between audio input and generated motion—hindering expressive, improvisational real-time responses. This paper proposes the first end-to-end framework for mapping raw audio directly to full-body joint trajectories of humanoid robots, eliminating conventional reconstruction pipelines. We introduce a “motion = content + style” paradigm, treating audio as an implicit stylistic signal; design the first redirection-free audio-to-motion architecture; and integrate a Residual Mixture-of-Experts (ResMoE) teacher policy with a diffusion-based student model to achieve disentangled style representation and cross-audio generalization. Evaluated on a physical humanoid platform, our method significantly improves physical plausibility and audio–motion alignment fidelity, enabling low-latency, high-fidelity, and expressive improvisational dance and synchronized gesture generation.

Technology Category

Application Category

📝 Abstract
Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.
Problem

Research questions and friction points this paper is trying to address.

Generates expressive humanoid locomotion directly from audio signals.
Eliminates explicit motion reconstruction to reduce errors and latency.
Enables real-time audio-driven dance and co-speech gestures for robots.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified audio-to-locomotion framework eliminates motion reconstruction
ResMoE teacher policy adapts to diverse motion patterns
Diffusion-based student policy injects audio style for low latency
🔎 Similar Papers
No similar papers found.
Z
Zhe Li
BAAI
Cheng Chi
Cheng Chi
Columbia University, Stanford University
robotics
Y
Yangyang Wei
Harbin Institute of Technology
B
Boan Zhu
Hong Kong University of Science and Technology
T
Tao Huang
Shanghai Jiao Tong University
Z
Zhenguo Sun
BAAI
Yibo Peng
Yibo Peng
Carnegie Mellon University
Code GenerationMultimodal NLPAI Agents
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Z
Zhongyuan Wang
BAAI
F
Fangzhou Liu
Harbin Institute of Technology
C
Chang Xu
University of Sydney
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models