🤖 AI Summary
This work addresses the challenge in video editing of simultaneously achieving high-quality foreground generation and background consistency: existing approaches either introduce background artifacts through full-frame injection or overly constrain foreground synthesis by rigidly locking the background. To resolve this trade-off, we propose KV-Lock, a training-free, plug-and-play control mechanism that dynamically links denoising prediction variance—used as a hallucination metric—with the classifier-free guidance (CFG) scale. This linkage enables adaptive modulation of both the fusion ratio of background key-value caches and the CFG strength during inference. Implemented within DiT-based video diffusion models, KV-Lock enables precise attention-level control and consistently outperforms current methods across diverse editing tasks, significantly improving both foreground generation quality and background fidelity.
📝 Abstract
Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.