YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of precise open-vocabulary object counting control in text-to-image generation—particularly where varying object scales and spatial distributions degrade count estimation accuracy. We propose the first differentiable, open-vocabulary counting model. Methodologically, we introduce a *cardinality map* as a continuous regression target, integrated with cross-modal representation alignment and a hybrid strong-weak supervision strategy, enabling end-to-end gradient-based optimization of the counting module jointly with the diffusion generator. This constitutes the first unified modeling paradigm bridging open-vocabulary counting and generative control. Experiments demonstrate state-of-the-art counting accuracy across multiple benchmarks and significantly improved robustness and controllability over target object quantities in text-to-image synthesis, establishing a novel paradigm for controllable generation.

Technology Category

Application Category

📝 Abstract
We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the 'cardinality' map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.
Problem

Research questions and friction points this paper is trying to address.

Develops a differentiable model for open-vocabulary object counting
Enables precise quantity control in text-to-image generation
Introduces cardinality map for size and spatial distribution variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable open-vocabulary object counting model
Novel cardinality map for size and distribution
Hybrid strong-weak supervision for alignment
🔎 Similar Papers
No similar papers found.
G
Guanning Zeng
Tsinghua University
X
Xiang Zhang
UC San Diego
Z
Zirui Wang
UC Berkeley
H
Haiyang Xu
UC San Diego
Z
Zeyuan Chen
UC San Diego
Bingnan Li
Bingnan Li
University of California, San Diego
Machine LearningComputer Vision
Zhuowen Tu
Zhuowen Tu
Professor, Cognitive Science, Computer Science&Engineering, UC San Diego
Computer VisionMachine LearningDeep LearningNeural Computation