🤖 AI Summary
Current UX evaluation in human-robot interaction (HRI) suffers from fragmentation and staticity, failing to capture temporal dynamics of user experience. To address this, we propose a dynamic UX estimation method grounded in multimodal social signals—specifically facial expressions and speech—integrated within an end-to-end framework that jointly leverages multiple-instance learning and Transformer architectures. This is the first approach to explicitly model UX fluctuations at both short-term (e.g., second-scale emotional shifts) and long-term (e.g., session-level adaptation) temporal scales, thereby overcoming the limitations of conventional single-time-point assessments. Empirical evaluation demonstrates statistically significant improvements in UX prediction accuracy over human annotators (p < 0.01). The method provides a deployable technical foundation for social robots to achieve fine-grained, real-time user state perception and adaptive behavioral modulation.
📝 Abstract
In recent years, the demand for social robots has grown, requiring them to adapt their behaviors based on users' states. Accurately assessing user experience (UX) in human-robot interaction (HRI) is crucial for achieving this adaptability. UX is a multi-faceted measure encompassing aspects such as sentiment and engagement, yet existing methods often focus on these individually. This study proposes a UX estimation method for HRI by leveraging multimodal social signals. We construct a UX dataset and develop a Transformer-based model that utilizes facial expressions and voice for estimation. Unlike conventional models that rely on momentary observations, our approach captures both short- and long-term interaction patterns using a multi-instance learning framework. This enables the model to capture temporal dynamics in UX, providing a more holistic representation. Experimental results demonstrate that our method outperforms third-party human evaluators in UX estimation.