🤖 AI Summary
Diffusion-based planners in offline reinforcement learning suffer from a long-horizon generalization bottleneck, primarily due to insufficient quality and diversity of offline trajectory data. To address this, we propose SCoTS—a reward-free trajectory augmentation framework. SCoTS introduces a novel state-coverage-oriented trajectory stitching paradigm, leverages temporal contrastive learning to construct latent representations that preserve temporal distances, and designs a direction-gradient-driven novelty metric to enable targeted exploration and diverse long-horizon trajectory generation. Crucially, SCoTS operates without reward signals and supports iterative fragment retrieval and stitching. Evaluated on multiple offline goal-conditioned benchmarks, SCoTS significantly enhances the long-horizon reasoning and task generalization capabilities of diffusion planners. When integrated into mainstream offline RL algorithms, it yields average performance improvements of 12–28%.
📝 Abstract
Diffusion-based generative models are emerging as powerful tools for long-horizon planning in reinforcement learning (RL), particularly with offline datasets. However, their performance is fundamentally limited by the quality and diversity of training data. This often restricts their generalization to tasks outside their training distribution or longer planning horizons. To overcome this challenge, we propose State-Covering Trajectory Stitching (SCoTS), a novel reward-free trajectory augmentation method that incrementally stitches together short trajectory segments, systematically generating diverse and extended trajectories. SCoTS first learns a temporal distance-preserving latent representation that captures the underlying temporal structure of the environment, then iteratively stitches trajectory segments guided by directional exploration and novelty to effectively cover and expand this latent space. We demonstrate that SCoTS significantly improves the performance and generalization capabilities of diffusion planners on offline goal-conditioned benchmarks requiring stitching and long-horizon reasoning. Furthermore, augmented trajectories generated by SCoTS significantly improve the performance of widely used offline goal-conditioned RL algorithms across diverse environments.