🤖 AI Summary
In long-horizon robotic manipulation tasks, erroneous subtask completion detection often triggers cascading failures, while existing vision-language-action (VLA) models lack intrinsic awareness of task completion states. To address this, we propose a dual-head VLA model explicitly endowed with subtask completion perception: one head generates action commands, while the other—lightweight and jointly optimized—estimates real-time completion status. Our design integrates cooperative architecture between the two heads and employs both joint and sequential fine-tuning strategies, enabling, for the first time, end-to-end completion-state modeling within a VLA framework. Evaluated on salad assembly and candy packing tasks, our method significantly improves sequential task success rates and eliminates downstream failures caused by completion misjudgment. Results demonstrate that explicit internal completion perception is critical for enhancing robustness in long-horizon manipulation.
📝 Abstract
Long-horizon robotic manipulation tasks require executing multiple interdependent subtasks in strict sequence, where errors in detecting subtask completion can cascade into downstream failures. Existing Vision-Language-Action (VLA) models such as $π_0$ excel at continuous low-level control but lack an internal signal for identifying when a subtask has finished, making them brittle in sequential settings. We propose SeqVLA, a completion-aware extension of $π_0$ that augments the base architecture with a lightweight detection head perceiving whether the current subtask is complete. This dual-head design enables SeqVLA not only to generate manipulation actions but also to autonomously trigger transitions between subtasks. We investigate four finetuning strategies that vary in how the action and detection heads are optimized (joint vs. sequential finetuning) and how pretrained knowledge is preserved (full finetuning vs. frozen backbone). Experiments are performed on two multi-stage tasks: salad packing with seven distinct subtasks and candy packing with four distinct subtasks. Results show that SeqVLA significantly outperforms the baseline $π_0$ and other strong baselines in overall success rate. In particular, joint finetuning with an unfrozen backbone yields the most decisive and statistically reliable completion predictions, eliminating sequence-related failures and enabling robust long-horizon execution. Our results highlight the importance of coupling action generation with subtask-aware detection for scalable sequential manipulation.