CountSteer: Steering Attention for Object Counting in Diffusion Models

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Text-to-image diffusion models frequently fail to accurately realize quantitative descriptions in prompts, revealing a fundamental misalignment between linguistic semantics and visual generation. To address this, we propose a training-free cross-attention modulation method: first, we identify an implicit numerical correctness signal embedded in the cross-attention maps of pretrained diffusion models; second, we dynamically reweight these attention maps based on the signal to explicitly steer the denoising process toward the target object count. Our approach is fully plug-and-play—requiring no architectural modifications, weight updates, or degradation in image fidelity. Evaluated across multiple benchmarks, it improves object counting accuracy by approximately four percentage points, substantially enhancing semantic faithfulness and controllability of generated images. This work establishes a new, interpretable, and low-overhead paradigm for open-vocabulary numerical understanding in diffusion-based generation.

Technology Category

Application Category

📝 Abstract

Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model's cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.

Problem

Research questions and friction points this paper is trying to address.

Addressing numerical instruction failures in text-to-image generation

Leveraging internal model signals for counting accuracy improvement

Steering cross-attention states to enhance object count control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Steers cross-attention hidden states during inference

Uses internal signals to guide numerical correctness

Training-free method improves object-count accuracy

🔎 Similar Papers

GCA-SUNet: A Gated Context-Aware Swin-UNet for Exemplar-Free Counting