🤖 AI Summary
Text-to-image diffusion models often sacrifice generation diversity when employing strong spatial guidance (e.g., segmentation or depth maps). This work addresses the problem of achieving fine-grained, subject-aware control without compromising diversity. We propose Depth Geometry Moments (DGM), a novel guidance signal that introduces robust high-order geometric moments into diffusion model conditioning—focusing on local geometric structure of the primary subject rather than global semantics or pixel-level details. Our method integrates learned geometric prior modeling with latent-space conditional guidance. Extensive evaluations across multiple benchmarks demonstrate that DGM significantly improves the trade-off between control accuracy and generation diversity: it enables flexible, stable, and subject-consistent synthesis while preserving image fidelity, thereby overcoming the diversity suppression bottleneck inherent to conventional spatial-map guidance.
📝 Abstract
Text-to-image generation models have achieved remarkable capabilities in synthesizing images, but often struggle to provide fine-grained control over the output. Existing guidance approaches, such as segmentation maps and depth maps, introduce spatial rigidity that restricts the inherent diversity of diffusion models. In this work, we introduce Deep Geometric Moments (DGM) as a novel form of guidance that encapsulates the subject's visual features and nuances through a learned geometric prior. DGMs focus specifically on the subject itself compared to DINO or CLIP features, which suffer from overemphasis on global image features or semantics. Unlike ResNets, which are sensitive to pixel-wise perturbations, DGMs rely on robust geometric moments. Our experiments demonstrate that DGM effectively balance control and diversity in diffusion-based image generation, allowing a flexible control mechanism for steering the diffusion process.