MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) in masked image generation suffers from low optimization efficiency, high computational cost, and suboptimal results due to multi-step trajectory dependencies. Method: This paper proposes a critical-step-focused RL optimization framework. Its core innovations are: (1) an information-gain evaluation mechanism—novelly based on similarity between intermediate and final images—to dynamically identify high-value sampling steps; and (2) an entropy-driven dynamic routing sampling strategy to suppress interference from irrelevant steps. The framework concentrates RL training exclusively on iterations exhibiting significant information gain, thereby minimizing futile exploration. Results: Evaluated across multiple text-to-image benchmarks, the method improves generation quality (FID reduced by 12.3%) and accelerates convergence (training steps reduced by 37%), all without modifying model architecture. These results validate both the effectiveness of the critical-step optimization paradigm and its generalizability across diverse architectures.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.
Problem

Research questions and friction points this paper is trying to address.

Adapting reinforcement learning to masked generative models efficiently
Reducing computational cost by focusing policy optimization on critical steps
Identifying valuable sampling steps using step-level information gain metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Focuses policy optimization on critical steps
Measures step-level information gain via image similarity
Uses dynamic routing sampling based on entropy
🔎 Similar Papers
No similar papers found.
Guohui Zhang
Guohui Zhang
Professor of Civil Engineering, University of Hawaii
Traffic EngineeringITSTraffic DetectionTraffic System ModelingSimulation
H
Hu Yu
University of Science and Technology of China
Xiaoxiao Ma
Xiaoxiao Ma
Oracle, Macquarie University
LLMdeep generative modelsanomaly detectiongraph neural networks
Y
Yaning Pan
Fudan University
H
Hang Xu
University of Science and Technology of China
F
Feng Zhao
University of Science and Technology of China