🤖 AI Summary
Existing text-guided image editing methods require full-image regeneration, leading to computational redundancy and structural degradation in unedited regions. This work proposes an autoregressive local editing paradigm that models editing as a “next edit token prediction” task, regenerating only user-specified regions without global reconstruction. Our key contributions are: (i) a novel arbitrarily ordered autoregressive text-to-image pretraining framework enabling zero-shot, fine-tuning-free region-specific editing; (ii) test-time iterative optimization and token expansion strategies for enhanced fidelity and flexibility. Evaluated on standard benchmarks, our method achieves state-of-the-art performance while significantly reducing computational overhead. Crucially, it better preserves semantic coherence and geometric consistency of the original image—particularly in non-edited areas—compared to prior full-image generation approaches.
📝 Abstract
Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/