DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges posed by abrupt rhythmic transitions and non-stationary inputs in real-time audio-driven full-body character animation by proposing a diffusion-forced streaming generation framework. The approach integrates a causal music encoder, a sequence diffusion model with heterogeneous noise scheduling, and a history-guided sampling strategy to explicitly balance responsiveness and long-term motion consistency while maintaining low latency. As the first diffusion-based architecture designed specifically for streaming audio-to-motion generation, the system significantly outperforms existing methods under identical causality and latency constraints, achieving improved audio-motion alignment accuracy and enhanced temporal stability, and enabling end-to-end real-time interaction.
📝 Abstract
We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly, including tempo shifts, drops, or user edits. Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.
Problem

Research questions and friction points this paper is trying to address.

real-time audio-driven character control
causal streaming
non-stationary audio
bounded-latency motion generation
audio-motion alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Forcing
Causal Audio Encoding
Streaming Motion Generation
Real-Time Character Control
Non-stationary Audio Conditioning
🔎 Similar Papers
No similar papers found.