Using VLM Reasoning to Constrain Task and Motion Planning

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

In task and motion planning (TAMP), high-level task plans often fail to refine into feasible continuous motions due to abstraction mismatches in symbolic models, leading to repeated replanning and inefficiency. To address this, we propose a pre-planning risk-aware mechanism leveraging pretrained vision-language models (VLMs): early in the planning process, the VLM performs cross-modal (image + text) commonsense reasoning over task actions to identify structures inherently unrefinable into motion, and automatically generates kinematic and geometric constraints to prune infeasible search spaces. This is the first work to integrate VLMs into the *pre-planning* phase of TAMP, enabling early alignment between abstract task semantics and embodied feasibility. Evaluated on two complex TAMP benchmarks, our method reduces average planning time by 42%, eliminates refinement failures entirely in several cases, and demonstrates strong generalization across unseen objects and configurations.

Technology Category

Application Category

📝 Abstract

In task and motion planning, high-level task planning is done over an abstraction of the world to enable efficient search in long-horizon robotics problems. However, the feasibility of these task-level plans relies on the downward refinability of the abstraction into continuous motion. When a domain's refinability is poor, task-level plans that appear valid may ultimately fail during motion planning, requiring replanning and resulting in slower overall performance. Prior works mitigate this by encoding refinement issues as constraints to prune infeasible task plans. However, these approaches only add constraints upon refinement failure, expending significant search effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the common-sense spatial reasoning of large pretrained Vision-Language Models to identify issues with downward refinement a priori, bypassing the need to fix these failures during planning. Experiments on two challenging TAMP domains show that our approach is able to extract plausible constraints from images and domain descriptions, drastically reducing planning times and, in some cases, eliminating downward refinement failures altogether, generalizing to a diverse range of instances from the broader domain.

Problem

Research questions and friction points this paper is trying to address.

Identifying infeasible task plans before motion planning execution

Reducing replanning delays caused by poor downward refinability

Leveraging Vision-Language Models for spatial constraint prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for spatial reasoning

Identifies refinement issues before planning starts

Extracts constraints from images and descriptions

🔎 Similar Papers

Scalable Task Planning via Large Language Models and Structured World Representations