Using VLM Reasoning to Constrain Task and Motion Planning

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In task and motion planning (TAMP), high-level task plans often fail to refine into feasible continuous motions due to abstraction mismatches in symbolic models, leading to repeated replanning and inefficiency. To address this, we propose a pre-planning risk-aware mechanism leveraging pretrained vision-language models (VLMs): early in the planning process, the VLM performs cross-modal (image + text) commonsense reasoning over task actions to identify structures inherently unrefinable into motion, and automatically generates kinematic and geometric constraints to prune infeasible search spaces. This is the first work to integrate VLMs into the *pre-planning* phase of TAMP, enabling early alignment between abstract task semantics and embodied feasibility. Evaluated on two complex TAMP benchmarks, our method reduces average planning time by 42%, eliminates refinement failures entirely in several cases, and demonstrates strong generalization across unseen objects and configurations.

Technology Category

Application Category

📝 Abstract
In task and motion planning, high-level task planning is done over an abstraction of the world to enable efficient search in long-horizon robotics problems. However, the feasibility of these task-level plans relies on the downward refinability of the abstraction into continuous motion. When a domain's refinability is poor, task-level plans that appear valid may ultimately fail during motion planning, requiring replanning and resulting in slower overall performance. Prior works mitigate this by encoding refinement issues as constraints to prune infeasible task plans. However, these approaches only add constraints upon refinement failure, expending significant search effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the common-sense spatial reasoning of large pretrained Vision-Language Models to identify issues with downward refinement a priori, bypassing the need to fix these failures during planning. Experiments on two challenging TAMP domains show that our approach is able to extract plausible constraints from images and domain descriptions, drastically reducing planning times and, in some cases, eliminating downward refinement failures altogether, generalizing to a diverse range of instances from the broader domain.
Problem

Research questions and friction points this paper is trying to address.

Identifying infeasible task plans before motion planning execution
Reducing replanning delays caused by poor downward refinability
Leveraging Vision-Language Models for spatial constraint prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for spatial reasoning
Identifies refinement issues before planning starts
Extracts constraints from images and descriptions
🔎 Similar Papers
No similar papers found.
M
Muyang Yan
Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
M
Miras Mengdibayev
Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
A
Ardon Floros
Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
W
Weihang Guo
Department of Computer Science, Rice University, Houston, TX 77005, USA
Lydia E. Kavraki
Lydia E. Kavraki
Rice University
RoboticsAIBioinformaticsAlgorithms
Zachary Kingston
Zachary Kingston
Assistant Professor of Computer Science, Purdue University
RoboticsMotion PlanningManipulation Planning