Long-Horizon Manipulation via Trace-Conditioned VLA Planning

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Existing vision-language-action (VLA) policies struggle with long-horizon manipulation tasks due to strong inter-step dependencies and susceptibility to error accumulation. This work proposes LoHo-Manip, a framework that decouples high-level planning from low-level execution by employing a progress-aware vision-language model (VLM) to generate, in a receding-horizon fashion, residual plans comprising subtask sequences and 2D keypoint trajectories to guide the VLA policy for local control. The approach introduces a lightweight language memory and a joint visual-trajectory prompting mechanism, enabling implicit closed-loop planning: at each step, it replans the remaining task, automatically retaining failed steps and updating trajectories without handcrafted recovery logic. Experiments demonstrate that this framework significantly improves success rates, robustness, and out-of-distribution generalization on long-horizon tasks, both in simulation and on a real Franka robot.

Technology Category

Application Category

📝 Abstract
Long-horizon manipulation remains challenging for vision-language-action (VLA) policies: real tasks are multi-step, progress-dependent, and brittle to compounding execution errors. We present LoHo-Manip, a modular framework that scales short-horizon VLA execution to long-horizon instruction following via a dedicated task-management VLM. The manager is decoupled from the executor and is invoked in a receding-horizon manner: given the current observation, it predicts a progress-aware remaining plan that combines (i) a subtask sequence with an explicit done + remaining split as lightweight language memory, and (ii) a visual trace -- a compact 2D keypoint trajectory prompt specifying where to go and what to approach next. The executor VLA is adapted to condition on the rendered trace, thereby turning long-horizon decision-making into repeated local control by following the trace. Crucially, predicting the remaining plan at each step yields an implicit closed loop: failed steps persist in subsequent outputs, and traces update accordingly, enabling automatic continuation and replanning without hand-crafted recovery logic or brittle visual-history buffers. Extensive experiments spanning embodied planning, long-horizon reasoning, trajectory prediction, and end-to-end manipulation in simulation and on a real Franka robot demonstrate strong gains in long-horizon success, robustness, and out-of-distribution generalization. Project page: https://www.liuisabella.com/LoHoManip
Problem

Research questions and friction points this paper is trying to address.

long-horizon manipulation
vision-language-action
multi-step tasks
execution errors
progress-dependent
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon manipulation
vision-language-action (VLA)
visual trace
receding-horizon planning
closed-loop replanning