🤖 AI Summary
Long-text prompts enhance fidelity in text-to-image (T2I) generation but severely suppress diversity, yielding repetitive and less creative outputs. To address this, we propose PromptMoG—a training-free Gaussian Mixture (MoG) sampling method in the embedding space—that increases sampling entropy and generative diversity while preserving semantic consistency via moment matching and semantic regularization. To rigorously evaluate long-prompt generation, we introduce LPD-Bench, the first dedicated benchmark for this task. Extensive experiments across four state-of-the-art models—SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image—demonstrate that PromptMoG significantly improves image diversity under long prompts without inducing semantic drift. Our approach establishes a new paradigm for controllable, diverse, and high-fidelity T2I generation.
📝 Abstract
Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.