Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the limitations of existing vision-language-action systems in long-horizon, partially observable, multi-stage manipulation tasks, where persistent memory and error recovery capabilities are often lacking. The authors propose a dual-system framework that decouples high-level semantic reasoning from low-level motor execution. The high-level planner leverages a vision-language model to decompose tasks, verify outcomes, and perform error-driven corrections, while the low-level executor generates geometry-preserving actions using a diffusion model and enables adaptive replanning through closed-loop feedback. For the first time, the approach integrates structured task memory with a closed-loop recovery mechanism, substantially enhancing system robustness. Evaluated on RMBench, the method achieves a success rate of 32.4%, significantly outperforming the strongest baseline at 9.8%, with ablation studies confirming the effectiveness of its core components.

Technology Category

Application Category

📝 Abstract
Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor control, but also persistent memory, adaptive task decomposition, and explicit recovery from execution failures. To address these limitations, we propose a dual-system framework for long-horizon embodied manipulation. Our framework explicitly separates high-level semantic reasoning from low-level motor execution. A high-level planner, implemented as a VLM-based agentic module, maintains structured task memory and performs goal decomposition, outcome verification, and error-driven correction. A low-level executor, instantiated as a VLA-based visuomotor controller, carries out each sub-task through diffusion-based action generation conditioned on geometry-preserving filtered observations. Together, the two systems form a closed loop between planning and execution, enabling memory-aware reasoning, adaptive replanning, and robust online recovery. Experiments on representative RMBench tasks show that the proposed framework substantially outperforms representative baselines, achieving a 32.4% average success rate compared with 9.8% for the strongest baseline. Ablation studies further confirm the importance of structured memory and closed-loop recovery for long-horizon manipulation.
Problem

Research questions and friction points this paper is trying to address.

long-horizon manipulation
partial observability
memory-dependent tasks
multi-stage dependencies
execution failure recovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-system framework
structured task memory
adaptive replanning
closed-loop recovery
diffusion-based action generation
Z
Zhen Liu
Beijing University of Posts and Telecommunications, InspireOmni AI
X
Xinyu Ning
Beijing University of Posts and Telecommunications, InspireOmni AI
Z
Zhe Hu
InspireOmni AI
X
Xinxin Xie
Beijing University of Posts and Telecommunications
W
Weize Li
Beijing University of Posts and Telecommunications
Zhipeng Tang
Zhipeng Tang
UMass Amherst
Chongyu Wang
Chongyu Wang
Florida State University
InvestmentReal EstateSustainability
Z
Zejun Yang
InspireOmni AI
Hanlin Wang
Hanlin Wang
HKUST
Computer visionVideo understanding
Y
Yitong Liu
Beijing University of Posts and Telecommunications
Z
Zhongzhu Pu
InspireOmni AI, Tsinghua University