Group Relative Attention Guidance for Image Editing

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion Transformer (DiT)-based image editing methods lack continuous, fine-grained control over editing strength. This work identifies that the layer-wise bias shared between Query and Key in MM-Attention implicitly encodes the model’s intrinsic editing behavior. Leveraging this insight, we propose a fine-tuning-free Grouped Relative Attention Guidance (GRAG) mechanism: by reweighting the token-wise differences between Query and Key—and their associated biases—we achieve continuous, controllable modulation of editing intensity. Designed natively for DiT architectures, GRAG integrates into mainstream editing frameworks with only four lines of code. Experiments across diverse editing tasks demonstrate that GRAG significantly outperforms classifier-free guidance, yielding smoother, more precise, and higher-fidelity edits. The method is notably simple, broadly applicable across DiT-based editors, and offers strong, interpretable control over editing strength.

Technology Category

Application Category

📝 Abstract
Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.
Problem

Research questions and friction points this paper is trying to address.

Controls editing intensity in diffusion transformer models
Reweights attention deltas for fine-grained editing control
Enables smooth editing degree adjustment without model tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reweights delta values of tokens
Modulates model focus on input
Enables fine-grained editing control
🔎 Similar Papers
No similar papers found.
X
Xuanpu Zhang
Tianjin University
Xuesong Niu
Xuesong Niu
Institute of Computing Technology; Kuaishou Technology
Affective ComputingComputer Vision
R
Ruidong Chen
Tianjin University
Dan Song
Dan Song
Tianjin University
J
Jianhao Zeng
Tianjin University
Penghui Du
Penghui Du
Southern University of Science and Technology, Undergraduate
NeuroscienceMachine LearningfMRI imaging
H
Haoxiang Cao
Kolors Team, Kuaishou Technology
K
Kai Wu
Kolors Team, Kuaishou Technology
A
An-an Liu
Tianjin University