🤖 AI Summary
This work addresses the challenge of enabling multi-robot systems to collaboratively interpret open-vocabulary natural language instructions and dexterously manipulate everyday objects in home kitchen environments. We propose a hierarchical architecture: an upper layer leverages multimodal large language and vision models for semantic understanding and online, interpretable task planning; a lower layer integrates lightweight control modules, vision-guided motor grasping policies, and human motion prediction models to enable real-time human–robot co-manipulation. We introduce the novel paradigm of “general-purpose large model capabilities + lightweight domain-specific modules” to achieve end-to-end human–robot co-cooking. In 60 real-world cooking trials with human participants, the system achieved a 68.3% overall task completion rate and a 91.6% subtask success rate. Module efficacy was further validated through 180 grasping trials, 60 motion prediction evaluations, and 46 user studies.
📝 Abstract
We present MOSAIC, a modular architecture for home robots to perform complex collaborative tasks, such as cooking with everyday users. MOSAIC tightly collaborates with humans, interacts with users using natural language, coordinates multiple robots, and manages an open vocabulary of everyday objects. At its core, MOSAIC employs modularity: it leverages multiple large-scale pre-trained models for general tasks like language and image recognition, while using streamlined modules designed for task-specific control. We extensively evaluate MOSAIC on 60 end-to-end trials where two robots collaborate with a human user to cook a combination of 6 recipes. We also extensively test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We show that MOSAIC is able to efficiently collaborate with humans by running the overall system end-to-end with a real human user, completing 68.3% (41/60) collaborative cooking trials of 6 different recipes with a subtask completion rate of 91.6%. Finally, we discuss the limitations of the current system and exciting open challenges in this domain. The project's website is at https://portal-cornell.github.io/MOSAIC/