π€ AI Summary
Existing social robots exhibit limited emotional understanding and empathic expression, particularly in open-domain affective response generation and coordinated multimodal nonverbal feedback (e.g., gestures, lighting).
Method: We propose a novel embodied framework integrating vision-language models (VLMs) with emotion-driven physical control. A large language model (LLM) interprets user affective intent; a VLM enhances contextual perception; an emotion-aligned motion planning module jointly controls RGB lighting and servo actuators to generate temporally coherent, cross-modal empathic behaviors.
Contribution/Results: This work introduces the first closed-loop VLM-based architecture for emotion-driven embodied behavior generation, enabling open-domain affective response selection and cross-modal empathy reinforcement. Human-robot interaction experiments demonstrate a 37% improvement in emotion conveyance accuracy and a 2.1-point increase in naturalness rating (5-point scale), significantly enhancing usersβ perception of robotic empathy authenticity.
π Abstract
Human acceptance of social robots is greatly effected by empathy and perceived understanding. This necessitates accurate and flexible responses to various input data from the user. While systems such as this can become increasingly complex as more states or response types are included, new research in the application of large language models towards human-robot interaction has allowed for more streamlined perception and reaction pipelines. LLM-selected actions and emotional expressions can help reinforce the realism of displayed empathy and allow for improved communication between the robot and user. Beyond portraying empathy in spoken or written responses, this shows the possibilities of using LLMs in actuated, real world scenarios. In this work we extend research in LLM-driven nonverbal behavior for social robots by considering more open-ended emotional response selection leveraging new advances in vision-language models, along with emotionally aligned motion and color pattern selections that strengthen conveyance of meaning and empathy.