SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing diffusion-based image editing methods suffer from high computational costs, and their dynamic resolution strategies often rely on low-level heuristics that fail to align with semantic editing requirements, leading to structural inconsistencies and redundant computation. This work proposes a training-free dynamic resolution framework employing a “draft-and-verify” mechanism: it first generates a low-resolution semantic draft, then identifies regions requiring editing based on token-level semantic discrepancies, and performs denoising only on these regions at high resolution while retaining the rest at low resolution. The method introduces, for the first time, a semantic locking mechanism that enables precise localization and efficient processing of edit-relevant areas. Experiments demonstrate speedups of up to 10× on Qwen-Image-Edit and 7× on FLUX.1-Kontext-dev, with combined optimizations achieving up to 13× acceleration while preserving high-quality editing results.

📝 Abstract

Diffusion-based image editing offers strong semantic controllability, but remains computationally expensive due to iterative high-resolution denoising over all spatial tokens. Dynamic-resolution sampling reduces this cost by performing early steps at reduced resolution. However, existing approaches prioritize upsampling using low-level heuristics such as edge detection or channel variance, which are weakly aligned with editing semantics and may lead to structural inconsistency. Moreover, spatial regions are often upsampled without verifying whether semantic modification is actually required, resulting in redundant high-resolution computation and accumulated errors. Therefore, we propose SpecEdit, a training-free dynamic-resolution framework tailored for diffusion-based image editing. SpecEdit follows a draft-and-verify scheme: a low-resolution draft first estimates the semantic outcome, after which token-level discrepancies are used to identify edit-relevant tokens for high-resolution denoising, while the remaining tokens stay at a coarse resolution. Experiments on Qwen-Image-Edit and FLUX.1-Kontext-dev demonstrate up to 10x and 7x acceleration, while maintaining strong quality. SpecEdit is complementary to step distillation and other acceleration techniques, achieving up to 13x speedup when combined with existing methods. Our code is in supplementary material and will be released on GitHub.

Problem

Research questions and friction points this paper is trying to address.

diffusion-based image editing

dynamic-resolution sampling

semantic consistency

computational efficiency

high-resolution denoising

Innovation

Methods, ideas, or system contributions that make the work stand out.

SpecEdit

semantic locking

training-free acceleration