Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of reinforcement learning in large discrete combinatorial action spaces, where exponential search complexity hinders effective exploration. Existing approaches either assume independence among sub-actions—leading to invalid action combinations—or jointly learn action structure and policy, resulting in slow and unstable training. To overcome these limitations, the authors propose SPIN, a two-stage framework that decouples action structure modeling from policy learning for the first time. In the first stage, an Action Structure Model (ASM) is pre-trained to learn a valid action manifold; in the second, its representation is frozen while a lightweight policy head is fine-tuned for control. This approach substantially improves stability, sample efficiency, and final performance in offline reinforcement learning, achieving up to a 39% increase in average return and up to 12.8× faster convergence on discrete DM Control benchmarks.

Technology Category

Application Category

📝 Abstract

Reinforcement learning in discrete combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over the state of the art while reducing time to convergence by up to 12.8$\times$.

Problem

Research questions and friction points this paper is trying to address.

Offline Reinforcement Learning

Discrete Action Spaces

Combinatorial Actions

Policy Learning

Action Coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Policy Initialization

Offline Reinforcement Learning

Discrete Action Spaces