Iterative Object Count Optimization for Text-to-image Diffusion Models

📅 2024-08-21

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Precise control over object quantity in text-to-image generation remains challenging: supervised methods suffer from poor generalization; denoising-based approaches rely on annotated data and incur high computational costs; and existing differentiable counting models lack robustness to noisy inputs and viewpoint variations while demanding expensive optimization. This paper proposes a zero-shot, plug-and-play online iterative optimization framework that requires no model retraining. It jointly optimizes text embeddings and counting hyperparameters, leveraging both the aggregation capability of a pre-trained counting model and classifier-guided variants to support non-differentiable detection-based counters. Crucially, counting tokens are transferable across images, significantly improving multi-category object count accuracy. The framework further enables rapid switching between counting strategies and generation modalities, enhancing flexibility and efficiency without architectural modifications.

Technology Category

Application Category

📝 Abstract

We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at https://ozzafar.github.io/count_token.

Problem

Research questions and friction points this paper is trying to address.

Accurately control object count in text-to-image generation

Improve robustness and image quality in object counting

Reduce computational cost of iterative optimization processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes counting token using outer-loop loss

Introduces detection-driven scaling term

Reuses optimized parameters for new prompts

🔎 Similar Papers

No similar papers found.