Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the limitation of existing robot co-speech gesture generation methods, which are often confined to rhythmic motions and struggle to integrate semantic emphasis with emotional expression. The authors propose a lightweight Transformer model that jointly models semantic gesture placement as a classification task and intensity as a regression task, using only textual and emotional inputs—eliminating the need for audio and enabling real-time generation. To the best of their knowledge, this is the first approach to achieve emotion-aware semantic gesture synthesis without audio cues. Evaluated on the BEAT2 dataset, the method outperforms GPT-4o in both placement and intensity prediction while maintaining a compact architecture, thereby significantly improving semantic alignment accuracy and offering practical applicability for resource-constrained embodied agents requiring real-time responsiveness.

Technology Category

Application Category

📝 Abstract

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

Problem

Research questions and friction points this paper is trying to address.

co-speech gesture

iconic gesture

emotion-aware

semantic emphasis

robot animation

Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion-aware gesture prediction

lightweight transformer

iconic gestures