Fine-Grained Perturbation Guidance via Attention Head Selection

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing attention perturbation methods for diffusion models lack fine-grained localization criteria, particularly struggling to precisely control generation quality and visual attributes in DiT architectures. This work first uncovers functional specialization among attention heads in DiT and proposes HeadHunter—a single-head selection framework based on iterative greedy search—and SoftPAG—a soft interpolation perturbation mechanism enabling continuous intensity adjustment. Together, they enable head-level targeted intervention over structural, stylistic, and textural attributes. The method is compatible with Stable Diffusion 3 and FLUX.1, significantly improving both fidelity and controllability in text-to-image generation. It effectively mitigates oversmoothing and artifacts while supporting compositional visual editing. Experimental results demonstrate superior attribute-specific control and generation quality compared to prior attention-perturbation approaches, establishing a new paradigm for interpretable, head-aware diffusion model editing.

Technology Category

Application Category

📝 Abstract
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose"HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
Problem

Research questions and friction points this paper is trying to address.

Determining optimal attention head selection for fine-grained control in DiT architectures
Mitigating oversmoothing issues in layer-level attention perturbation methods
Enabling targeted manipulation of specific visual styles through head selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

HeadHunter selects attention heads for fine control
SoftPAG interpolates attention maps for smooth tuning
Targets specific visual concepts via head-level perturbation
🔎 Similar Papers
No similar papers found.