π€ AI Summary
To address background distortion and editing failure in diffusion-based instruction editing caused by stochastic noise, this paper proposes ELECTβa zero-shot, unsupervised early seed selection framework. ELECT identifies high-reliability seeds during the initial sampling stage by quantifying inconsistency in background regions within the latent space at early diffusion timesteps, requiring no external validators or additional training. It enables collaborative optimization of seeds and prompts with multimodal large language models (MLLMs) and integrates seamlessly into instruction-guided editing pipelines. Experiments demonstrate that ELECT reduces average computational cost by 41% (up to 61%), significantly improves background consistency and instruction adherence, and raises the editing success rate on previously failed cases to approximately 40%.
π Abstract
Despite recent advances in diffusion models, achieving reliable image generation and editing remains challenging due to the inherent diversity induced by stochastic noise in the sampling process. Instruction-guided image editing with diffusion models offers user-friendly capabilities, yet editing failures, such as background distortion, frequently occur. Users often resort to trial and error, adjusting seeds or prompts to achieve satisfactory results, which is inefficient. While seed selection methods exist for Text-to-Image (T2I) generation, they depend on external verifiers, limiting applicability, and evaluating multiple seeds increases computational complexity. To address this, we first establish a multiple-seed-based image editing baseline using background consistency scores, achieving Best-of-N performance without supervision. Building on this, we introduce ELECT (Early-timestep Latent Evaluation for Candidate Selection), a zero-shot framework that selects reliable seeds by estimating background mismatches at early diffusion timesteps, identifying the seed that retains the background while modifying only the foreground. ELECT ranks seed candidates by a background inconsistency score, filtering unsuitable samples early based on background consistency while preserving editability. Beyond standalone seed selection, ELECT integrates into instruction-guided editing pipelines and extends to Multimodal Large-Language Models (MLLMs) for joint seed and prompt selection, further improving results when seed selection alone is insufficient. Experiments show that ELECT reduces computational costs (by 41 percent on average and up to 61 percent) while improving background consistency and instruction adherence, achieving around 40 percent success rates in previously failed cases - without any external supervision or training.