Scaling Offline RL via Efficient and Expressive Shortcut Models

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

In offline reinforcement learning, diffusion- or flow-based generative policies suffer from inefficient training and non-scalable inference due to iterative multi-step sampling. To address this, we propose SORL—a novel algorithm introducing *shortcut models*, the first of their kind, which leverage the Q-function as a sampling validator to enable behavior-cloning-style, single-stage end-to-end training and support both serial and parallel efficient inference. Our core innovation lies in unifying diffusion and flow modeling principles into a *Q-guided single-step sampling mechanism*: at test time, increasing computational budget yields consistent performance gains—contrary to diminishing returns observed in prior methods. SORL achieves state-of-the-art results on standard benchmarks including D4RL, significantly outperforming existing generative offline RL approaches. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at nico-espinosadice.github.io/projects/sorl.

Problem

Research questions and friction points this paper is trying to address.

Applying diffusion models to offline RL is challenging due to iterative noise sampling.

SORL introduces shortcut models for scalable offline RL training and inference.

SORL captures complex data distributions with efficient one-stage training.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages shortcut models for scalable RL

One-stage training for complex distributions

Q-function verifier enables parallel inference

🔎 Similar Papers

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining