Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

📅 2024-09-30
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of realizing remote multimodal interaction for home-service robots. We propose a zero-shot cross-modal embodied grounding framework that unifies Zoom video streaming, first-person visual input, speech/text commands, and pointing gestures into a cohesive multimodal interaction paradigm. Our method leverages synergistic LLM–VLM coordination to generate open-vocabulary, multi-step navigation and manipulation plans, enabling real-time video stream processing and embodied execution. Key contributions include: (i) the first demonstration of zero-shot cross-modal remote instruction understanding and execution in real-world home environments; and (ii) substantial improvements in the naturalness of remote teleoperation and users’ confidence in task completion. Extensive experiments validate the framework’s effectiveness and robustness across diverse, complex household tasks.

Technology Category

Application Category

📝 Abstract
Imagine a future when we can Zoom-call a robot to manage household chores remotely. This work takes one step in this direction. Robi Butler is a new household robot assistant that enables seamless multimodal remote interaction. It allows the human user to monitor its environment from a first-person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multistep action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.
Problem

Research questions and friction points this paper is trying to address.

Enables remote household chore management via multimodal interaction.
Interprets voice, text, and gestures using Large Language Models.
Executes complex commands in real-world home environments.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal remote interaction via Zoom
LLMs interpret multimodal instructions for action plans
Vision-language models process text and gestural inputs
🔎 Similar Papers
No similar papers found.