🤖 AI Summary
This work addresses the limitation of existing robot co-speech gesture generation methods, which are often confined to rhythmic motions and struggle to integrate semantic emphasis with emotional expression. The authors propose a lightweight Transformer model that jointly models semantic gesture placement as a classification task and intensity as a regression task, using only textual and emotional inputs—eliminating the need for audio and enabling real-time generation. To the best of their knowledge, this is the first approach to achieve emotion-aware semantic gesture synthesis without audio cues. Evaluated on the BEAT2 dataset, the method outperforms GPT-4o in both placement and intensity prediction while maintaining a compact architecture, thereby significantly improving semantic alignment accuracy and offering practical applicability for resource-constrained embodied agents requiring real-time responsiveness.
📝 Abstract
Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.