Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing text-guided image editing methods based on visual autoregressive (VAR) models suffer from limitations in spatial localization accuracy and structural consistency. This work proposes a novel framework that analyzes the intermediate feature distributions of VAR models to enable precise editing control. Specifically, it introduces a coarse-to-fine token localization strategy coupled with a structure-aware feature injection mechanism, and further incorporates reinforcement learning to achieve adaptive feature fusion. The proposed approach significantly enhances structural preservation and overall output quality, outperforming state-of-the-art methods on both local and global image editing tasks.

Technology Category

Application Category

📝 Abstract

Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.

Problem

Research questions and friction points this paper is trying to address.

text-guided image editing

visual autoregressive models

structure preservation

token localization

structural consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual autoregressive models

structure preservation

token localization