🤖 AI Summary
This work addresses the challenge that existing text-to-image (T2I) generation models struggle to balance semantic alignment with output diversity, often limiting user choice and amplifying societal biases. To overcome this, the authors propose a geometric-aware spherical sampling method that explicitly disentangles prompt-relevant from prompt-irrelevant variation directions in the CLIP embedding space for the first time. By leveraging orthogonal decomposition and geometric projection of CLIP embeddings, the approach extends the generation trajectory along two orthogonal axes to enhance diversity. The method is compatible with various frozen T2I backbones—including U-Net and DiT—and integrates seamlessly into both diffusion and flow-based generative frameworks. Extensive experiments demonstrate that it significantly improves generation diversity across multiple benchmarks and architectures while preserving image fidelity and semantic alignment with minimal degradation.
📝 Abstract
Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.