Accelerating Diffusion Models in Offline RL via Reward-Aware Consistency Trajectory Distillation

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the slow inference of diffusion-based policies in offline reinforcement learning, this paper proposes Reward-Aware Consistent Trajectory Distillation (RACTD), the first method to explicitly incorporate reward optimization into the single-stage distillation process of consistency models—eliminating reliance on multi-network co-training and suboptimal expert demonstrations. RACTD achieves high performance while drastically simplifying training: it requires only a single network and generates actions in one step. Evaluated on the Gym MuJoCo benchmark, RACTD outperforms existing state-of-the-art methods by 8.7% in task performance and accelerates inference by up to 142×. The approach thus delivers exceptional efficiency, architectural simplicity, and strong generalization across diverse tasks—all without sacrificing policy quality.

Technology Category

Application Category

📝 Abstract
Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While the consistency model offers a potential solution, its applications to decision-making often struggle with suboptimal demonstrations or rely on complex concurrent training of multiple networks. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method enables single-step generation while maintaining higher performance and simpler training. Empirical evaluations on the Gym MuJoCo benchmarks and long horizon planning demonstrate that our approach can achieve an 8.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.
Problem

Research questions and friction points this paper is trying to address.

Slow inference speed of diffusion models in offline RL
Suboptimal performance in consistency model applications
Complex training requirements for multiple networks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-aware consistency distillation for RL
Single-step generation with higher performance
142x speedup over diffusion models
🔎 Similar Papers
No similar papers found.