🤖 AI Summary
This work addresses the insufficient semantic modeling of hand gestures in speech-driven gesture synthesis by explicitly generating hand motions with clear, instructive semantics. To this end, we introduce the first high-quality 3D semantic gesture dataset tailored to anchor-style scenarios and propose a hybrid-modality diffusion Transformer architecture. Our approach innovatively incorporates motion-style injection Transformer layers and a cascaded retrieval-augmented generation (RAG) mechanism. Coupled with a personalized semantic gesture library and an adaptive audio–gesture synchronization module, it enables precise semantic activation and natural temporal alignment. Experiments demonstrate that our method significantly outperforms existing approaches in semantic accuracy, generation efficiency, and synchronization quality—thereby enhancing both information conveyance fidelity and perceptual naturalness in human–computer interaction.
📝 Abstract
While increasing attention has been paid to co-speech gesture synthesis, most previous works neglect to investigate hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system GestureHYDRA built upon a hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded retrieval-augmented generation strategy built upon a semantic gesture repository annotated for each subject and an adaptive audio-gesture synchronization mechanism, which substantially improves semantic gesture activation and production efficiency. Quantitative and qualitative experiments demonstrate that our proposed approach achieves superior performance over all the counterparts. The project page can be found at https://mumuwei.github.io/GestureHYDRA/.