Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

πŸ“… 2025-09-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of achieving high-fidelity, fine-grained text-guided image editing without model fine-tuning. To this end, we propose VARINβ€”a novel framework that pioneers the integration of noise inversion into Vision Autoregressive (VAR) models. VARIN introduces a position-aware Argmax Inversion (LAI) mechanism, enabling reversible modeling in the Gumbel noise space and facilitating precise, text-driven edits. Leveraging pseudo-inverse functions over discrete token sequences and structural preservation constraints, VARIN performs accurate content replacement and attribute modification without updating any model parameters. Extensive experiments demonstrate that VARIN consistently preserves the original image’s structure, background, and fine details while significantly improving editing accuracy and robustness across diverse text instructions. Quantitative and qualitative evaluations show that VARIN outperforms state-of-the-art zero-shot editing methods in generation quality and fidelity.

Technology Category

Application Category

πŸ“ Abstract
Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.
Problem

Research questions and friction points this paper is trying to address.

Enables prompt-guided image editing without additional training
Inverts noise for precise reconstruction and targeted text-aligned edits
Preserves original background and structural details during editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

VARIN for autoregressive model image editing
Location-aware Argmax Inversion for noise generation
Inverse Gumbel noises enable precise image reconstruction
πŸ”Ž Similar Papers
No similar papers found.