🤖 AI Summary
To address the slow inference of diffusion-based policies in offline reinforcement learning, this paper proposes Reward-Aware Consistent Trajectory Distillation (RACTD), the first method to explicitly incorporate reward optimization into the single-stage distillation process of consistency models—eliminating reliance on multi-network co-training and suboptimal expert demonstrations. RACTD achieves high performance while drastically simplifying training: it requires only a single network and generates actions in one step. Evaluated on the Gym MuJoCo benchmark, RACTD outperforms existing state-of-the-art methods by 8.7% in task performance and accelerates inference by up to 142×. The approach thus delivers exceptional efficiency, architectural simplicity, and strong generalization across diverse tasks—all without sacrificing policy quality.
📝 Abstract
Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While the consistency model offers a potential solution, its applications to decision-making often struggle with suboptimal demonstrations or rely on complex concurrent training of multiple networks. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method enables single-step generation while maintaining higher performance and simpler training. Empirical evaluations on the Gym MuJoCo benchmarks and long horizon planning demonstrate that our approach can achieve an 8.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.