Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

To address the scalability challenge of reinforcement learning (RL) policies in large combinatorial action spaces, this paper proposes a novel policy representation framework based on discrete diffusion models (DDMs). Methodologically, it formulates policy learning as a discrete diffusion process—generating action distributions progressively via forward noising and reverse denoising. Crucially, it replaces conventional fixed priors with a differentiable, entropy-regularized dynamic target distribution derived from policy mirror descent. Furthermore, it adopts a decoupled training paradigm, separating DDM pretraining from RL fine-tuning to enhance training stability and sample efficiency. Evaluated on DNA sequence generation, macro-action control, and multi-agent coordination tasks, the approach consistently outperforms state-of-the-art baselines, achieving an average 2.1× improvement in sample efficiency. Notably, it constitutes the first end-to-end co-design of DDMs and policy optimization, bridging generative modeling and RL in combinatorial action settings.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

Problem

Research questions and friction points this paper is trying to address.

Scaling reinforcement learning to large combinatorial action spaces

Developing stable diffusion policies for complex action spaces

Improving sample efficiency in combinatorial reinforcement learning benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete diffusion models as effective policies

Online training process for stable policy improvement

Policy mirror descent for distributional matching problem

🔎 Similar Papers

Diffusion Models Meet Contextual Bandits with Large Action Spaces