🤖 AI Summary
In long-horizon vision-language-action (VLA) tasks, existing VLM-based planners rely on manual annotations or heuristic rules for subtask decomposition, leading to distributional mismatch with underlying visual-motor policy training data and degrading performance. To address this, we propose a retrieval-augmented demonstration decomposition method that automatically aligns subtasks with the policy’s visual feature distribution by matching against low-level policy training data—introducing retrieval into hierarchical task decomposition for the first time, without manual annotation or predefined rules. Our approach integrates VLM-based planning, visual feature retrieval, trajectory alignment, and hierarchical decomposition. Evaluated in both simulation and real-world settings, it outperforms state-of-the-art methods, achieving significant gains in robustness and cross-scene generalization.
📝 Abstract
To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io.