GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the insufficient semantic modeling of hand gestures in speech-driven gesture synthesis by explicitly generating hand motions with clear, instructive semantics. To this end, we introduce the first high-quality 3D semantic gesture dataset tailored to anchor-style scenarios and propose a hybrid-modality diffusion Transformer architecture. Our approach innovatively incorporates motion-style injection Transformer layers and a cascaded retrieval-augmented generation (RAG) mechanism. Coupled with a personalized semantic gesture library and an adaptive audio–gesture synchronization module, it enables precise semantic activation and natural temporal alignment. Experiments demonstrate that our method significantly outperforms existing approaches in semantic accuracy, generation efficiency, and synchronization quality—thereby enhancing both information conveyance fidelity and perceptual naturalness in human–computer interaction.

Technology Category

Application Category

📝 Abstract

While increasing attention has been paid to co-speech gesture synthesis, most previous works neglect to investigate hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system GestureHYDRA built upon a hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded retrieval-augmented generation strategy built upon a semantic gesture repository annotated for each subject and an adaptive audio-gesture synchronization mechanism, which substantially improves semantic gesture activation and production efficiency. Quantitative and qualitative experiments demonstrate that our proposed approach achieves superior performance over all the counterparts. The project page can be found at https://mumuwei.github.io/GestureHYDRA/.

Problem

Research questions and friction points this paper is trying to address.

Generating semantically rich co-speech hand gestures

Enhancing gesture modeling with hybrid diffusion transformers

Improving gesture activation via retrieval-augmented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-modality diffusion transformer architecture

Cascaded retrieval-augmented generation strategy

Adaptive audio-gesture synchronization mechanism

🔎 Similar Papers

No similar papers found.