🤖 AI Summary
This work addresses the insufficient characterization of the multiscale dynamic properties of cross-attention in existing diffusion models, which limits training-free controllable generation. Treating cross-attention during diffusion as a spatiotemporal signal in latent space, the study reveals—for the first time—a stable time–frequency evolution pattern throughout the denoising process. Building on this insight, the authors propose a plug-and-play inference-time intervention method that enables continuous scale control without modifying prompts or model parameters. The approach combines Fourier-domain attention log-modulation, radial frequency band reweighting, timestep-aligned scheduling, and an adaptive gating mechanism based on token assignment entropy. Evaluated on Stable Diffusion, the method effectively redistributes the attention spectrum, significantly enhancing visual editing quality while preserving semantic consistency, and demonstrates that entropy primarily serves as an adaptive gain rather than an independent control dimension.
📝 Abstract
Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.