🤖 AI Summary
This study addresses the limitations of existing emotion recognition approaches in human-robot collaboration, which often rely on acted data and single modalities such as facial expressions, thereby failing to capture the complexity of emotions in real-world settings. To overcome this, the work proposes the first integration of Vision-Language Models (VLMs) into this domain, leveraging multimodal contextual information and incorporating mechanisms for semantic and affective alignment to achieve more accurate emotion understanding. The system dynamically modulates service robot behaviors based on this enriched emotional interpretation, enhancing interaction naturalness. Experimental results demonstrate that the proposed method significantly outperforms CNN-based baselines in both semantic similarity and positive affect alignment, while user studies confirm that participants consistently prefer the robot’s emotion-adaptive behaviors generated by this approach.
📝 Abstract
Human-robot collaboration (HRC) can benefit from robots' abilities to interpret human emotional states. However, current emotion recognition (ER) models in HRC often fall short, particularly due to their reliance on acted datasets and single-modality inputs like facial expressions. We propose a novel vision language model (VLM)-based ER system that leverages contextual understanding to improve emotion interpretation in HRC. We first evaluate the VLM-ER system by assessing its semantic and sentiment similarity with human annotations on an existing HRC dataset. Then, in a user study with a service robot in a collaborative delivery task, we evaluate the effects of modulating the robot's behaviour based on the user's emotional state inferred by the VLM-ER system. The results show that the proposed VLM-ER system achieves higher semantic similarity and positive sentiment alignment with human annotations compared to a baseline convolutional neural network-based system. Further, participants in the user study preferred emotion-adaptive robot behaviour facilitated by the VLM-ER system.