RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-horizon vision-language-action (VLA) tasks, existing VLM-based planners rely on manual annotations or heuristic rules for subtask decomposition, leading to distributional mismatch with underlying visual-motor policy training data and degrading performance. To address this, we propose a retrieval-augmented demonstration decomposition method that automatically aligns subtasks with the policy’s visual feature distribution by matching against low-level policy training data—introducing retrieval into hierarchical task decomposition for the first time, without manual annotation or predefined rules. Our approach integrates VLM-based planning, visual feature retrieval, trajectory alignment, and hierarchical decomposition. Evaluated in both simulation and real-world settings, it outperforms state-of-the-art methods, achieving significant gains in robustness and cross-scene generalization.

Technology Category

Application Category

📝 Abstract
To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io.
Problem

Research questions and friction points this paper is trying to address.

Automates decomposition of demonstrations into sub-tasks
Aligns sub-task intervals with visuomotor policy training data
Enhances performance in long-horizon hierarchical planning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-based decomposition aligns sub-tasks with policy data
Automatically segments demonstrations using visual feature matching
Improves planner alignment without heuristic or manual annotation
🔎 Similar Papers
No similar papers found.
Mingxuan Yan
Mingxuan Yan
University of California, Riverside
Y
Yuping Wang
University of California, Riverside, University of Michigan
Zechun Liu
Zechun Liu
Meta AI
computer vision
J
Jiachen Li
University of California, Riverside