Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) generation faces significant challenges in compositional synthesis—requiring precise modeling of multiple objects, diverse attributes, and intricate spatial/semantic relationships, while ensuring accurate object localization and coherent inter-object interactions. To address this, we propose CompGen, the first curriculum-based reinforcement learning framework tailored for compositional T2I generation. CompGen introduces a scene-graph–driven difficulty metric, an adaptive MCMC sampling algorithm to curate progressively challenging training data, and Group Relative Policy Optimization (GRPO) for staged policy refinement. Experiments demonstrate substantial improvements in compositional fidelity across both diffusion- and autoregressive-based T2I models, consistently outperforming random-sampling baselines. Notably, our work uncovers the first empirical scaling law linking curriculum scheduling strategies to compositional generation performance. This validates the effectiveness and generalizability of curriculum RL in T2I systems.

Technology Category

Application Category

📝 Abstract
Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.
Problem

Research questions and friction points this paper is trying to address.

Addresses compositional weakness in text-to-image generation models
Develops difficulty-aware curriculum training for complex scene synthesis
Enhances rendering of multi-object scenes with precise relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum reinforcement learning for text-to-image generation
Difficulty-aware scene graph sampling for training data
Integration with Group Relative Policy Optimization framework
🔎 Similar Papers
No similar papers found.
S
Shijian Wang
Southeast University
R
Runhao Fu
Monash University
S
Siyi Zhao
Shanghai Jiao Tong University
Q
Qingqin Zhan
Independent Researcher
X
Xingjian Wang
Monash University
Jiarui Jin
Jiarui Jin
Xiaohongshu; Shanghai Jiao Tong University; University College London
Multimodal MiningRecommender SystemInformation RetrievalLarge Language Model
Yuan Lu
Yuan Lu
I-squared-R
BlockchainsDistributed ComputingDecentralization
H
Hanqian Wu
Southeast University
Cunjian Chen
Cunjian Chen
Monash University
Generative AIComputer VisionDeep Learning