Inpainting-Guided Policy Optimization for Diffusion Large Language Models

πŸ“… 2025-09-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address low exploration efficiency, sparse rewards, and sample inefficiency in large language model (LLM) reinforcement learning (RL), this paper proposes IGPOβ€”the first framework to leverage the inpainting capability of diffusion LLMs (dLLMs) for RL exploration guidance. Methodologically, IGPO strategically inserts verified reasoning segments into dLLM generation to steer sampling toward high-reward trajectories; it further integrates inpainting-guided online sampling, entropy-based filtering, and concise trajectory distillation to enable smooth transition from supervised fine-tuning to RL, while restoring effective gradients and mitigating the zero-advantage problem in group-based optimization (e.g., GRPO). Experiments demonstrate that IGPO achieves state-of-the-art performance on GSM8K, Math500, and AMC mathematical reasoning benchmarks, with significant improvements in sample efficiency and training stability.

Technology Category

Application Category

πŸ“ Abstract
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new state-of-the-art results for full-attention masked dLLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing RL exploration efficiency for diffusion LLMs
Leveraging inpainting to guide policy optimization strategies
Addressing sparse rewards and sample waste in alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inpainting-guided RL for diffusion LLMs
Partial ground-truth insertion during sampling
Synthetic concise trace fine-tuning alignment
πŸ”Ž Similar Papers
No similar papers found.