Long-Text-to-Image Generation via Compositional Prompt Decomposition

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Current text-to-image models struggle to accurately interpret key details in long paragraph prompts due to their training primarily on short captions. To address this limitation, this work proposes PRISM, a method that decomposes lengthy prompts into semantic constituents via a lightweight module, leverages a pretrained text-to-image model to independently denoise each constituent, and integrates the results through an energy-based fusion mechanism. Notably, PRISM requires no fine-tuning of the backbone model and is the first to combine compositional prompt decomposition with energy-based fusion. It achieves performance comparable to specialized fine-tuned models across multiple architectures and surpasses baseline methods by 7.4% on public benchmarks involving prompts exceeding 500 words, significantly enhancing long-text comprehension, generation fidelity, and length generalization.

Technology Category

Application Category

📝 Abstract

While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity. We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction. We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.

Problem

Research questions and friction points this paper is trying to address.

long-text-to-image generation

text-to-image models

descriptive paragraphs

prompt length limitation

image fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional prompt decomposition

long-text-to-image generation

energy-based conjunction