DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenges posed by abrupt rhythmic transitions and non-stationary inputs in real-time audio-driven full-body character animation by proposing a diffusion-forced streaming generation framework. The approach integrates a causal music encoder, a sequence diffusion model with heterogeneous noise scheduling, and a history-guided sampling strategy to explicitly balance responsiveness and long-term motion consistency while maintaining low latency. As the first diffusion-based architecture designed specifically for streaming audio-to-motion generation, the system significantly outperforms existing methods under identical causality and latency constraints, achieving improved audio-motion alignment accuracy and enhanced temporal stability, and enabling end-to-end real-time interaction.

📝 Abstract

We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly, including tempo shifts, drops, or user edits. Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.

Problem

Research questions and friction points this paper is trying to address.

real-time audio-driven character control

causal streaming

non-stationary audio

bounded-latency motion generation

audio-motion alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Forcing

Causal Audio Encoding

Streaming Motion Generation