Visual Autoregressive Modeling for Instruction-Guided Image Editing

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based instruction-guided image editing methods suffer from excessive coupling between edited regions and contextual content due to global denoising, leading to artifacts and poor instruction adherence. This paper proposes VAREdit, a novel approach that reformulates image editing as a multi-scale visual autoregressive generation task: conditioned on both text instructions and source image features, it progressively generates discrete target image tokens in a causal attention framework. A key innovation is the Scale-Aligned Reference (SAR) module, which enables cross-scale alignment of source features to guide generation—effectively preserving fine-grained details lost in coarse-level predictions. On standard benchmarks, VAREdit achieves a GPT-Balance score over 30% higher than state-of-the-art diffusion methods. It edits 512×512 images in just 1.2 seconds—2.2× faster than UltraEdit—while significantly improving fidelity, inference efficiency, and instruction controllability.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30%+ higher GPT-Balance score. Moreover, it completes a $512 imes512$ editing in 1.2 seconds, making it 2.2$ imes$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
Problem

Research questions and friction points this paper is trying to address.

Addresses unintended spurious modifications in diffusion-based image editing
Solves adherence challenges to text instructions in image editing
Bridges scale-mismatch in conditioning source image tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual autoregressive framework for sequential image editing
Scale-Aligned Reference module for multi-scale conditioning
Next-scale prediction with text and image guidance
🔎 Similar Papers
No similar papers found.