🤖 AI Summary
In task and motion planning (TAMP), high-level task plans often fail to refine into feasible continuous motions due to abstraction mismatches in symbolic models, leading to repeated replanning and inefficiency. To address this, we propose a pre-planning risk-aware mechanism leveraging pretrained vision-language models (VLMs): early in the planning process, the VLM performs cross-modal (image + text) commonsense reasoning over task actions to identify structures inherently unrefinable into motion, and automatically generates kinematic and geometric constraints to prune infeasible search spaces. This is the first work to integrate VLMs into the *pre-planning* phase of TAMP, enabling early alignment between abstract task semantics and embodied feasibility. Evaluated on two complex TAMP benchmarks, our method reduces average planning time by 42%, eliminates refinement failures entirely in several cases, and demonstrates strong generalization across unseen objects and configurations.
📝 Abstract
In task and motion planning, high-level task planning is done over an abstraction of the world to enable efficient search in long-horizon robotics problems. However, the feasibility of these task-level plans relies on the downward refinability of the abstraction into continuous motion. When a domain's refinability is poor, task-level plans that appear valid may ultimately fail during motion planning, requiring replanning and resulting in slower overall performance. Prior works mitigate this by encoding refinement issues as constraints to prune infeasible task plans. However, these approaches only add constraints upon refinement failure, expending significant search effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the common-sense spatial reasoning of large pretrained Vision-Language Models to identify issues with downward refinement a priori, bypassing the need to fix these failures during planning. Experiments on two challenging TAMP domains show that our approach is able to extract plausible constraints from images and domain descriptions, drastically reducing planning times and, in some cases, eliminating downward refinement failures altogether, generalizing to a diverse range of instances from the broader domain.