CountSteer: Steering Attention for Object Counting in Diffusion Models

๐Ÿ“… 2025-11-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Text-to-image diffusion models frequently fail to accurately realize quantitative descriptions in prompts, revealing a fundamental misalignment between linguistic semantics and visual generation. To address this, we propose a training-free cross-attention modulation method: first, we identify an implicit numerical correctness signal embedded in the cross-attention maps of pretrained diffusion models; second, we dynamically reweight these attention maps based on the signal to explicitly steer the denoising process toward the target object count. Our approach is fully plug-and-playโ€”requiring no architectural modifications, weight updates, or degradation in image fidelity. Evaluated across multiple benchmarks, it improves object counting accuracy by approximately four percentage points, substantially enhancing semantic faithfulness and controllability of generated images. This work establishes a new, interpretable, and low-overhead paradigm for open-vocabulary numerical understanding in diffusion-based generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model's cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.
Problem

Research questions and friction points this paper is trying to address.

Addressing numerical instruction failures in text-to-image generation
Leveraging internal model signals for counting accuracy improvement
Steering cross-attention states to enhance object count control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Steers cross-attention hidden states during inference
Uses internal signals to guide numerical correctness
Training-free method improves object-count accuracy
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hyemin Boo
Ewha Womans University, Republic of Korea
H
Hyoryung Kim
Ewha Womans University, Republic of Korea
Myungjin Lee
Myungjin Lee
Cisco Systems
NetworkingSystems
Seunghyeon Lee
Seunghyeon Lee
S2W
cyber-threat intelligencecryptocurrencycybercrimesoftware-defined networkingnetwork security
Jiyoung Lee
Jiyoung Lee
Assistant Professor, Ewha Womans University
Multimodal LearningComputer VisionMachine Learning
J
Jang-Hwan Choi
Ewha Womans University, Republic of Korea
H
Hyunsoo Cho
Ewha Womans University, Republic of Korea