GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing text-to-image (T2I) generation models struggle to balance semantic alignment with output diversity, often limiting user choice and amplifying societal biases. To overcome this, the authors propose a geometric-aware spherical sampling method that explicitly disentangles prompt-relevant from prompt-irrelevant variation directions in the CLIP embedding space for the first time. By leveraging orthogonal decomposition and geometric projection of CLIP embeddings, the approach extends the generation trajectory along two orthogonal axes to enhance diversity. The method is compatible with various frozen T2I backbones—including U-Net and DiT—and integrates seamlessly into both diffusion and flow-based generative frameworks. Extensive experiments demonstrate that it significantly improves generation diversity across multiple benchmarks and architectures while preserving image fidelity and semantic alignment with minimal degradation.

Technology Category

Application Category

📝 Abstract
Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.
Problem

Research questions and friction points this paper is trying to address.

text-to-image generation
diversity enhancement
semantic alignment
societal bias
image diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-Aware Spherical Sampling
Disentangled Diversity
CLIP Embedding Decomposition
Text-to-Image Generation
Orthogonal Variation
🔎 Similar Papers
No similar papers found.
Ye Zhu
Ye Zhu
Assistant Professor, École Polytechnique
Generative ModelsComputer VisionML4Astrophysics
K
Kaleb S. Newman
Department of Computer Science, Princeton University, USA
J
Johannes F. Lutzeyer
Laboratoire d’Informatique (LIX), CNRS, École Polytechnique, IPP, France
Adriana Romero-Soriano
Adriana Romero-Soriano
Fundamental AI Research, Meta
deep learningmachine learningAI
M
Michal Drozdzal
FAIR at Meta - Montreal, Canada
Olga Russakovsky
Olga Russakovsky
Associate Professor, Princeton University
Computer vision