D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Text-to-image diffusion models frequently fail to accurately generate the exact number of objects specified in prompts. To address this, we propose D2D, the first framework that repurposes a pre-trained object detector as a differentiable critic—leveraging its enumerative counting capability to refine the noise prior during inference for improved numerical consistency. Our method introduces a custom activation function that maps detector logits to soft binary indicators, enabling end-to-end differentiable counting supervision and overcoming modeling limitations inherent in conventional regression-based critics. Evaluated across multiple benchmarks, D2D achieves up to a 13.7% absolute improvement in counting accuracy, while preserving image fidelity and incurring negligible computational overhead. This work establishes a novel paradigm for discrete-structure-aware generation by grounding diffusion sampling in detection-driven, differentiable structural priors.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Improving object counting accuracy in text-to-image generation

Transforming non-differentiable detectors into differentiable critics

Enhancing numeracy without compromising image quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms non-differentiable detectors into differentiable critics

Uses custom activation functions for soft binary indicators

Optimizes noise prior during inference with pretrained models

🔎 Similar Papers

No similar papers found.

Authors to Follow