FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address poor motion coherence, inaccurate temporal alignment, and insufficient modeling of the true motion distribution in text-driven real-time human motion generation, this paper proposes FloodDiffusion—a novel diffusion-based framework. It introduces diffusion forcing into streaming motion generation for the first time, employs bidirectional attention to capture long-range temporal dependencies, enforces causality via a lower-triangular time scheduling scheme, and designs a continuous time-varying text-conditioning injection strategy. These innovations jointly enable low-latency, high-fidelity, and tightly aligned motion sequence generation. On the HumanML3D benchmark, FloodDiffusion achieves a state-of-the-art FID score of 0.057. Moreover, it supports real-time inference, significantly improving both motion naturalness and text-motion alignment fidelity.

Technology Category

Application Category

📝 Abstract

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/

Problem

Research questions and friction points this paper is trying to address.

Generates seamless human motion from time-varying text prompts

Improves diffusion forcing for accurate motion distribution modeling

Achieves state-of-the-art streaming motion generation with real-time latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion forcing for time-varying text prompts

Tailors attention, scheduler, and conditioning for motion generation

Achieves state-of-the-art performance with real-time latency

🔎 Similar Papers

SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion