SmoothSync: Dual-Stream Diffusion Transformers for Jitter-Robust Beat-Synchronized Gesture Generation from Quantized Audio

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a dual-stream diffusion Transformer architecture based on quantized audio tokens to address key limitations in existing speech-driven gesture generation methods, including temporal misalignment, motion jitter, foot sliding, and insufficient diversity. By fusing audio and motion features, the model achieves high temporal synchronization, while a dedicated jitter suppression loss enhances motion smoothness. Diversity is further improved through probabilistic audio quantization. To better evaluate generation quality under real-world conditions, the authors introduce Smooth-BC, a noise-robust evaluation metric. Experiments on the BEAT2 dataset demonstrate significant improvements over state-of-the-art methods: a 30.6% reduction in FGD, a 10.3% gain in Smooth-BC, an 8.4% increase in diversity, and substantial reductions of 62.9% and 17.1% in jitter and foot sliding, respectively.

Technology Category

Application Category

📝 Abstract
Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat consistency metric less sensitive to motion noise. Comprehensive experiments on the BEAT2 and SHOW datasets demonstrate SmoothSync's superiority, outperforming state-of-the-art methods by -30.6% FGD, 10.3% Smooth-BC, and 8.4% Diversity on BEAT2, while reducing jitter and foot sliding by -62.9% and -17.1% respectively. The code will be released to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

co-speech gesture generation
motion jitter
rhythmic inconsistency
foot sliding
sampling diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Stream Diffusion Transformer
Jitter Suppression
Probabilistic Audio Quantization
Beat-Synchronized Gesture Generation
Smooth-BC
🔎 Similar Papers
No similar papers found.
Yujiao Jiang
Yujiao Jiang
Shandong university, weihai
number theory
Q
Qingmin Liao
Shenzhen International Graduate School, Tsinghua University, Shenzhen, 518055, China
Zongqing Lu
Zongqing Lu
Peking University | BeingBeyond
Reinforcement learning