🤖 AI Summary
Existing robotic facial expression generation methods rely on hand-crafted animations, lacking dynamism, cross-context adaptability, and platform scalability—resulting in emotionally monotonous interactions and weak user resonance over extended engagement. This paper proposes the first large language model (LLM)-driven, three-stage framework: temporal encoding, context-conditioned modeling, and facial token generation—enabling natural, real-time, and situationally adaptive expressions. By integrating LLMs into affective robotics, our approach supports semantic understanding, dynamic temporal modeling, and multimodal contextual perception, thereby transcending the limitations of predefined animation libraries. The system employs a lightweight interface for seamless deployment across heterogeneous robotic hardware platforms. Two user studies (n=30) and a longitudinal child-family deployment (n=13) demonstrate a 68% improvement in expression dynamism, 91% contextual alignment accuracy, and significant enhancements in perceived warmth, trustworthiness, and interactional immersion.
📝 Abstract
Facial expressions are vital in human communication and significantly influence outcomes in human-robot interaction (HRI), such as likeability, trust, and companionship. However, current methods for generating robotic facial expressions are often labor-intensive, lack adaptability across contexts and platforms, and have limited expressive ranges--leading to repetitive behaviors that reduce interaction quality, particularly in long-term scenarios. We introduce Xpress, a system that leverages language models (LMs) to dynamically generate context-aware facial expressions for robots through a three-phase process: encoding temporal flow, conditioning expressions on context, and generating facial expression code. We demonstrated Xpress as a proof-of-concept through two user studies (n=15x2) and a case study with children and parents (n=13), in storytelling and conversational scenarios to assess the system's context-awareness, expressiveness, and dynamism. Results demonstrate Xpress's ability to dynamically produce expressive and contextually appropriate facial expressions, highlighting its versatility and potential in HRI applications.