VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

๐Ÿ“… 2026-01-05
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the instability and misalignment challenges in training vision-based autoregressive (VAR) models within reinforcement learning, which arise from asynchronous policy conflicts during generation. To mitigate spatiotemporal optimization interference, the authors propose an enhanced Group Relative Policy Optimization (GRPO) framework that integrates intermediate reward guidance, a dynamic timestep reweighting mechanism, and a mask propagation algorithm grounded in Reward Feedback Learning (ReFL). This approach systematically alleviates policy inconsistencies by jointly resolving temporal and spatial misalignments, marking the first method to cohesively tackle strategy conflicts inherent in VAR-based RL. The proposed framework significantly improves sample quality and policy alignment, enabling stable and efficient end-to-end optimization.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
Problem

Research questions and friction points this paper is trying to address.

Visual Autoregressive
asynchronous policy conflicts
reinforcement learning
VAR models
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Autoregressive
Asynchronous Policy Conflicts
Group Relative Policy Optimization
Reward Feedback Learning
Mask Propagation
๐Ÿ”Ž Similar Papers
No similar papers found.