Semantic Co-Speech Gesture Synthesis and Real-Time Control for Humanoid Robots

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the need for natural nonverbal interaction in humanoid robots by proposing an end-to-end, speech-driven framework for semantically consistent gesture generation and real-time physical execution. Methodologically: (1) a semantic-aware gesture synthesis module integrates LLM-powered generative retrieval with Motion-GPT’s autoregressive modeling to achieve precise speech-to-gesture semantic mapping; (2) a high-fidelity imitation controller, MotionTracker, coupled with Generalized Motion Retargeting (GMR), ensures robust, dynamically balanced execution on hardware. Evaluated on the Unitree G1 robot, the system generates rhythmically coherent and semantically accurate gestures in real time, significantly enhancing the naturalness and deployment stability of human–robot nonverbal interaction. Key contributions include a unified semantic–motion joint modeling framework and a lightweight, physics-aware motion retargeting strategy tailored for real-world robotic execution.

Technology Category

Application Category

📝 Abstract
We present an innovative end-to-end framework for synthesizing semantically meaningful co-speech gestures and deploying them in real-time on a humanoid robot. This system addresses the challenge of creating natural, expressive non-verbal communication for robots by integrating advanced gesture generation techniques with robust physical control. Our core innovation lies in the meticulous integration of a semantics-aware gesture synthesis module, which derives expressive reference motions from speech input by leveraging a generative retrieval mechanism based on large language models (LLMs) and an autoregressive Motion-GPT model. This is coupled with a high-fidelity imitation learning control policy, the MotionTracker, which enables the Unitree G1 humanoid robot to execute these complex motions dynamically and maintain balance. To ensure feasibility, we employ a robust General Motion Retargeting (GMR) method to bridge the embodiment gap between human motion data and the robot platform. Through comprehensive evaluation, we demonstrate that our combined system produces semantically appropriate and rhythmically coherent gestures that are accurately tracked and executed by the physical robot. To our knowledge, this work represents a significant step toward general real-world use by providing a complete pipeline for automatic, semantic-aware, co-speech gesture generation and synchronized real-time physical deployment on a humanoid robot.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing semantically meaningful co-speech gestures for robots
Enabling real-time physical execution of gestures on humanoid robots
Bridging the embodiment gap between human motion and robot platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLM-based generative retrieval for semantic gesture synthesis.
Uses MotionTracker imitation learning for dynamic robot motion control.
Applies General Motion Retargeting to adapt human motions to robots.
🔎 Similar Papers
No similar papers found.