🤖 AI Summary
This work addresses the challenge of enabling household service robots to make context-aware decisions aligned with human values based on real-time multimodal inputs. We present the first integration of the GPT-4o multimodal large language model into the TurtleBot 4 platform, developing an intelligent vacuuming robot system that perceives domestic environments visually, reasons about human values such as cleanliness, comfort, and safety, and autonomously decides whether to initiate cleaning tasks accordingly. Deployed in real-world home settings, the system demonstrates a notable ability to infer contextual cues and user preferences from limited visual input, significantly enhancing robotic autonomy and situational awareness. Our findings also highlight critical challenges in value alignment concerning consistency, bias, and real-time responsiveness.
📝 Abstract
In this work, we explore how multimodal large language models can support real-time context- and value-aware decision-making. To do so, we combine the GPT-4o language model with a TurtleBot 4 platform simulating a smart vacuum cleaning robot in a home. The model evaluates the environment through vision input and determines whether it is appropriate to initiate cleaning. The system highlights the ability of these models to reason about domestic activities, social norms, and user preferences and take nuanced decisions aligned with the values of the people involved, such as cleanliness, comfort, and safety. We demonstrate the system in a realistic home environment, showing its ability to infer context and values from limited visual input. Our results highlight the promise of multimodal large language models in enhancing robotic autonomy and situational awareness, while also underscoring challenges related to consistency, bias, and real-time performance.