🤖 AI Summary
This work addresses the challenge of accurate state estimation in partially observable environments, where robotic task and motion planning is often hindered by task-irrelevant unexpected objects. The authors propose CoCo-TAMP, a novel framework that integrates the commonsense reasoning capabilities of large language models (LLMs) into hierarchical state estimation for the first time. By modeling object location priors and co-occurrence relationships, CoCo-TAMP refines belief states over task-relevant objects without requiring manually constructed knowledge bases. This approach significantly enhances the efficiency of long-horizon task and motion planning. Experimental results demonstrate that CoCo-TAMP reduces planning and execution time by 62.7% on average in simulation and achieves a 72.6% improvement in real-robot trials, substantially outperforming baseline methods lacking commonsense reasoning.
📝 Abstract
Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7 in planning and execution time in simulation, and 72.6 in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.