HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing diffusion models often suffer from concept omission and compositional inconsistency when processing complex, multi-object, multi-level text prompts. To address this, we propose HiCoGen, a hierarchical image generation framework based on stepwise synthesis. Our approach first leverages large language models to decompose intricate prompts into semantically coherent units. Second, it introduces a chain-based synthesis paradigm that iteratively generates sub-images conditioned on preceding visual context. Third, it incorporates a hierarchical reward mechanism coupled with a decaying stochasticity sampling strategy to jointly optimize semantic fidelity and structural coherence. We introduce HiCoPrompt, a novel benchmark specifically designed to evaluate hierarchical compositional understanding. On this benchmark, HiCoGen achieves state-of-the-art performance, significantly outperforming prior methods in concept coverage, compositional accuracy, and layout plausibility.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

Problem

Research questions and friction points this paper is trying to address.

Existing diffusion models struggle with complex prompts containing multiple objects

Current models fail to maintain hierarchical structures in generated images

Standard diffusion samplers lack sufficient exploration for effective reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical decomposition of prompts using LLM

Iterative image synthesis with visual context

Reinforcement learning with decaying stochasticity schedule

🔎 Similar Papers

No similar papers found.