CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

πŸ“… 2025-11-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current co-speech gesture generation methods suffer from two key challenges: (1) lack of semantic priors due to the absence of textual annotations in gesture datasets, and (2) difficulty in achieving fine-grained multimodal coordination. To address these, we propose the first unified framework for joint gesture understanding and caption generation, establishing a bidirectional gesture–text mapping to bridge the semantic gap. Our method introduces a motion-language model that produces multi-granular descriptive captions, coupled with a conditional latent diffusion model integrated with a hierarchical denoiser for cross-modal, layered controllable synthesis. The framework enables fine-grained control over rhythm, semantics, and stylistic attributes within a unified cross-dataset representation. Experiments demonstrate significant improvements over state-of-the-art methods in rhythm synchronization, semantic consistency, and generation quality, while also achieving higher inference efficiency.

Technology Category

Application Category

πŸ“ Abstract
Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.
Problem

Research questions and friction points this paper is trying to address.

Bridging the semantic gap in gesture datasets lacking descriptive text annotations
Enabling coordinated multimodal control for co-speech gesture generation
Generating gestures synchronized with speech and semantically coherent with captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gesture captioning bridges semantic gap in datasets
Conditional latent diffusion model enables coordinated gesture control
Hierarchical denoiser achieves multimodal gesture-text mapping
πŸ”Ž Similar Papers
No similar papers found.