🤖 AI Summary
This work addresses the challenge that existing mobile manipulators struggle to express their intentions in real time during human-robot collaboration and lack support for user-initiated interruptions, modifications, or redirections. To bridge this gap, we propose ExpressMM, a novel framework that enables interruptible, expressive mobile manipulation behaviors for the first time. ExpressMM integrates a high-level language-guided planner with a low-level vision-language-action policy, leveraging vision-language models for joint perception and dialog-based reasoning. This integration ensures socially appropriate and interpretable robot behaviors while allowing real-time user intervention during task execution. User studies demonstrate that ExpressMM significantly enhances the interpretability of robot intent, as well as perceived safety, predictability, and practical utility, leading to consistently positive interaction experiences.
📝 Abstract
Mobile manipulators are increasingly deployed in human-centered environments to perform tasks. While completing such tasks, they should also be able to communicate their intent to the people around them using expressive robot behaviors. Prior work on expressive robot behaviors has used preprogrammed or learning-from-demonstration- based expressive motions and large language model generated high-level interactions. The majority of these existing approaches have not considered human-robot interactions (HRI) where users may interrupt, modify, or redirect a robot's actions during task execution. In this paper, we develop the novel ExpressMM framework that integrates a high-level language-guided planner based on a vision-language model for perception and conversational reasoning with a low-level vision-language-action policy to generate expressive robot behaviors during collaborative HRI tasks. Furthermore, ExpressMM supports interruptible interactions to accommodate updated or redirecting instructions by users. We demonstrate ExpressMM on a mobile manipulator assisting a human in a collaborative assembly scenario and conduct audience-based evaluation of live HRI demonstrations. Questionnaire results show that the ExpressMM-enabled expressive behaviors helped observers clearly interpret the robot's actions and intentions while supporting socially appropriate and understandable interactions. Participants also reported that the robot was useful for collaborative tasks and behaved in a predictable and safe manner during the demonstrations, fostering positive perceptions of the robot's usefulness, safety, and predictability during the collaborative tasks.