๐ค AI Summary
Text-to-image diffusion models frequently fail to accurately realize quantitative descriptions in prompts, revealing a fundamental misalignment between linguistic semantics and visual generation. To address this, we propose a training-free cross-attention modulation method: first, we identify an implicit numerical correctness signal embedded in the cross-attention maps of pretrained diffusion models; second, we dynamically reweight these attention maps based on the signal to explicitly steer the denoising process toward the target object count. Our approach is fully plug-and-playโrequiring no architectural modifications, weight updates, or degradation in image fidelity. Evaluated across multiple benchmarks, it improves object counting accuracy by approximately four percentage points, substantially enhancing semantic faithfulness and controllability of generated images. This work establishes a new, interpretable, and low-overhead paradigm for open-vocabulary numerical understanding in diffusion-based generation.
๐ Abstract
Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model's cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.