🤖 AI Summary
Existing multi-task path planning approaches rely heavily on extensive expert demonstrations or hand-crafted reward functions, limiting their adaptability to novel tasks and rendering them sensitive to data quality. To address this, we propose SODP—a two-stage diffusion-based planning framework. First, it performs unsupervised pretraining on large-scale task-agnostic suboptimal trajectory data to learn a generalizable trajectory distribution; second, it enables lightweight fine-tuning using only minimal task-specific reward signals. SODP is the first method to integrate diffusion modeling with distributional generalization for representation learning, eliminating dependence on expert policies or precisely engineered reward functions. Evaluated on Meta-World and Adroit benchmarks, SODP achieves significant improvements over state-of-the-art methods, demonstrating superior cross-task adaptation efficiency and robustness to data imperfections.
📝 Abstract
Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks. However, existing multi-task planners or policies typically rely on task-specific demonstrations via multi-task imitation, or require task-specific reward labels to facilitate policy optimization via Reinforcement Learning (RL). They are costly due to the substantial human efforts required to collect expert data or design reward functions. To address these challenges, we aim to develop a versatile diffusion planner capable of leveraging large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks. In this paper, we propose SODP, a two-stage framework that leverages Sub-Optimal data to learn a Diffusion Planner, which is generalizable for various downstream tasks. Specifically, in the pre-training stage, we train a foundation diffusion planner that extracts general planning capabilities by modeling the versatile distribution of multi-task trajectories, which can be sub-optimal and has wide data coverage. Then for downstream tasks, we adopt RL-based fine-tuning with task-specific rewards to quickly refine the diffusion planner, which aims to generate action sequences with higher task-specific returns. Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning.