NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a training-free, zero-shot hierarchical framework to address the disconnect between high-level semantic reasoning and low-level execution in long-horizon robotic manipulation. The approach uniquely integrates closed-loop vision-language model (VLM) planning with video generation, leveraging hand poses and object keypoints from generated videos as motion priors. Precise execution is achieved through geometric constraints, while a switching mechanism handles occlusions or depth estimation errors and enables runtime failure detection and autonomous replanning. Experiments on three long-horizon tasks and the Functional Manipulation Benchmark demonstrate that the system robustly performs complex assembly operations, exhibiting strong zero-shot generalization and agile error recovery capabilities.

Technology Category

Application Category

📝 Abstract
Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/
Problem

Research questions and friction points this paper is trying to address.

long-horizon manipulation
zero-shot planning
physical grounding
closed-loop execution
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot manipulation
closed-loop planning
vision-language models
kinematic priors
long-horizon tasks
🔎 Similar Papers
No similar papers found.
Jiahui Fu
Jiahui Fu
Research Scientist, Boston Dynamics AI Institute
SLAMPerceptionRoboticsEnergy Systems
J
Junyu Nan
Robotics and AI Institute, Carnegie Mellon University; Brown University
Lingfeng Sun
Lingfeng Sun
RAI Institute
RoboticsAutonomous Driving
H
Hongyu Li
Robotics and AI Institute, Carnegie Mellon University; University of Pennsylvania
Jianing Qian
Jianing Qian
University of Pennsylvania
J
Jennifer L. Barry
Robotics and AI Institute, Carnegie Mellon University
Kris Kitani
Kris Kitani
Carnegie Mellon University, Meta FAIR
Computer VisionAIMachine Learning
G
George Konidaris
University of Pennsylvania