Accelerating Diffusion Models in Offline RL via Reward-Aware Consistency Trajectory Distillation

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the slow inference of diffusion-based policies in offline reinforcement learning, this paper proposes Reward-Aware Consistent Trajectory Distillation (RACTD), the first method to explicitly incorporate reward optimization into the single-stage distillation process of consistency models—eliminating reliance on multi-network co-training and suboptimal expert demonstrations. RACTD achieves high performance while drastically simplifying training: it requires only a single network and generates actions in one step. Evaluated on the Gym MuJoCo benchmark, RACTD outperforms existing state-of-the-art methods by 8.7% in task performance and accelerates inference by up to 142×. The approach thus delivers exceptional efficiency, architectural simplicity, and strong generalization across diverse tasks—all without sacrificing policy quality.

Technology Category

Application Category

📝 Abstract

Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While the consistency model offers a potential solution, its applications to decision-making often struggle with suboptimal demonstrations or rely on complex concurrent training of multiple networks. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method enables single-step generation while maintaining higher performance and simpler training. Empirical evaluations on the Gym MuJoCo benchmarks and long horizon planning demonstrate that our approach can achieve an 8.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.

Problem

Research questions and friction points this paper is trying to address.

Slow inference speed of diffusion models in offline RL

Suboptimal performance in consistency model applications

Complex training requirements for multiple networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-aware consistency distillation for RL

Single-step generation with higher performance

142x speedup over diffusion models

🔎 Similar Papers

Enhancing Sample Efficiency and Exploration in Reinforcement Learning through the Integration of Diffusion Models and Proximal Policy Optimization