🤖 AI Summary
Existing instruction-driven image editing methods rely heavily on large-scale, high-quality triplets (instruction/source image/edited image) for supervised training, incurring prohibitive computational costs and limiting edit fidelity due to imprecise instruction semantics. This paper proposes a novel paradigm that reformulates image editing as a degradation temporal process, leveraging single-frame evolution priors learned from video pretraining to enable fine-grained, data-efficient cross-modal collaborative editing. Our approach integrates multimodal foundation models, diffusion-based architectures, and video temporal modeling. With only 1% of the supervision required by state-of-the-art methods, it achieves performance on par with the current best open-source benchmarks. Key contributions include: (i) the first introduction of a temporal degradation perspective for image editing; (ii) a substantial reduction in dependence on annotated supervision; and (iii) high-fidelity, controllable editing within a unified instruction-visual joint embedding space.
📝 Abstract
We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of {instruction, source image, edited image} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.