When to Lock Attention: Training-Free KV Control in Video Diffusion

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge in video editing of simultaneously achieving high-quality foreground generation and background consistency: existing approaches either introduce background artifacts through full-frame injection or overly constrain foreground synthesis by rigidly locking the background. To resolve this trade-off, we propose KV-Lock, a training-free, plug-and-play control mechanism that dynamically links denoising prediction variance—used as a hallucination metric—with the classifier-free guidance (CFG) scale. This linkage enables adaptive modulation of both the fusion ratio of background key-value caches and the CFG strength during inference. Implemented within DiT-based video diffusion models, KV-Lock enables precise attention-level control and consistently outperforms current methods across diverse editing tasks, significantly improving both foreground generation quality and background fidelity.

Technology Category

Application Category

📝 Abstract

Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.

Problem

Research questions and friction points this paper is trying to address.

video editing

background consistency

foreground quality

diffusion models

KV control

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-Lock

training-free

video diffusion