DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving semantic expressiveness and biomechanical rhythmicity in co-speech gesture generation by proposing a dual-stream collaborative architecture that disentangles gestures into a semantic stream and a beat stream. The semantic stream leverages action–language conditional modeling to provide gesture priors triggered by long-tail lexical cues, while the beat stream incorporates anthropometry-based inertial beat priors to enhance rhythmic consistency. These two streams are dynamically coordinated via a semantic variational information bottleneck, enabling frame-level stochastic stream selection. Integrating neuro-inspired and biomechanical mechanisms, the method outperforms strong holistic baselines in both objective metrics and subjective evaluations. Ablation studies further confirm the complementary efficacy of semantic grounding, stochastic stream selection, and biomechanical regularization.

📝 Abstract

Co-speech gesture generation requires both semantic expressivity and biomechanically plausible rhythmic motion. Existing holistic gesture models mix lexically grounded semantic gestures with frequent prosody-aligned beat gestures. This limits semantic grounding, speech-motion alignment, and kinematic smoothness. We propose \emph{DuoGesture}, a neuro-inspired and biomechanically informed dual-stream approach that decomposes co-speech gesture synthesis into coupled semantic and beat streams. The two streams are coordinated by a \emph{Semantic Variational Information Bottleneck}, a stochastic frame-level gate that learns when semantic gestures should override rhythmic beat motion. The semantic stream is controlled by \emph{Motion-Grounded Semantic Conditioning}, which replaces purely linguistic word embeddings with motion-language representations to provide motion-aligned semantic priors for long-tailed lexical triggers of gestures. The beat stream is further regularised by an \emph{Inertial Beat Prior}, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames. Objective evaluations and subjective experiments show that DuoGesture outperforms strong holistic baselines, while component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularisation.

Problem

Research questions and friction points this paper is trying to address.

co-speech gesture generation

semantic grounding

beat gestures

biomechanical plausibility

speech-motion alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-stream architecture

semantic-beat decomposition

motion-grounded conditioning