🤖 AI Summary
This work addresses the challenge of text-driven image and video editing, which requires efficiently reconstructing masked regions while preserving content consistency. Existing approaches rely on computationally expensive vector-Jacobian products (VJPs), limiting their practicality. To overcome this, the authors formulate the task as an inpainting problem and propose a test-time guidance approximation that eliminates the need for VJPs. Their method leverages pre-trained diffusion models without requiring additional training, enabling efficient and consistent edits. Theoretical analysis supports the validity of the proposed approximation, and extensive experiments on large-scale image and video editing benchmarks demonstrate that it matches or even surpasses the performance of training-based methods, while significantly improving inference efficiency and practical applicability.
📝 Abstract
Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector--Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.