🤖 AI Summary
In offline reinforcement learning, generative policies face a trade-off between computational efficiency and performance: diffusion models incur high computational costs, while consistency models suffer from limited single-step generation quality. This paper proposes Generative Trajectory Policies (GTP), the first framework unifying diffusion, flow matching, and consistency models under a continuous-time ordinary differential equation (ODE) formulation for trajectory generation. Its core innovation lies in formalizing generative policy learning as mapping to ODE solutions. Guided by theoretical analysis, GTP introduces two key improvements: (i) a provably convergent flow matching objective and (ii) a consistency-guided trajectory calibration mechanism—jointly overcoming inherent limitations of single-step and iterative models. Evaluated on the D4RL benchmark, GTP achieves state-of-the-art performance, attaining perfect scores on the challenging AntMaze tasks while maintaining efficient inference and faithful multimodal behavior generation.
📝 Abstract
Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.