🤖 AI Summary
This work addresses the critical limitation of existing vision-language models in task planning, which often neglect spatial executability and thus fail to guide real-world robotic manipulation. To bridge this gap, we introduce a novel task termed "spatially grounded long-horizon task planning," establish a benchmark dataset named GroundedPlanBench, and propose the V2GP framework. V2GP leverages real robot demonstration videos to automatically generate hierarchical planning data annotated with spatial grounding, enabling joint optimization of high-level action sequences and low-level spatial interaction points. Experimental results demonstrate that V2GP significantly enhances the spatial executability of generated plans on both GroundedPlanBench and physical robot platforms, advancing task planning toward practical deployment in real-world environments.
📝 Abstract
Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the plan, limiting evaluation of real-world manipulation capability. To bridge this gap, we define a novel task of grounded planning and introduce GroundedPlanBench, a newly curated benchmark for spatially grounded long-horizon action planning in the wild. GroundedPlanBench jointly evaluates hierarchical sub-action planning and spatial action grounding (where to act), enabling systematic assessment of whether generated sub-actions are spatially executable for robot manipulation. We further introduce Video-to-Spatially Grounded Planning (V2GP), an automated data generation framework that leverages real-world robot video demonstrations to improve spatially grounded long-horizon planning. Our evaluations reveal that spatially grounded long-horizon planning remains a major bottleneck for current VLMs. Our results demonstrate that V2GP provides a promising approach for improving both action planning and spatial grounding performance, validated on our benchmark as well as through real-world robot manipulation experiments, advancing progress toward spatially actionable planning.