Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
Diffusion-based multimodal large language models in reinforcement learning suffer from challenges such as inaccurate importance ratio estimation, neglect of the hierarchical structure inherent in image generation, and misaligned credit assignment due to uniform reward signals. To address these issues, this work proposes HT-GRPO, the first method to explicitly incorporate the hierarchical nature of image generation into reinforcement learning optimization. It introduces a Sketch-Then-Paint three-stage training paradigm—progressing from global layout to structural details and finally to fine-grained refinement—and integrates a prompt-conditioned importance ratio estimator with a hierarchical credit assignment strategy. Evaluated on GenEval and DPG benchmarks, HT-GRPO significantly outperforms existing approaches, achieving consistent improvements across six metrics encompassing image quality, aesthetic appeal, and human preference.
📝 Abstract
Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Multi-Modal Large Language Models
Reinforcement Learning
Hierarchical Generation
Importance Ratio
Credit Assignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Reinforcement Learning
Diffusion Multi-Modal LLM
Sketch-Then-Paint
Credit Assignment
Importance Ratio Estimation