🤖 AI Summary
This work addresses the prevalent issue of temporal incoherence—manifesting as flickering, drifting, and unstable motion—in text-to-video diffusion models. It reveals, for the first time, an intrinsic link between such artifacts and the non-smoothness along the temporal diagonal of self-attention maps. To mitigate this, the authors propose a plug-and-play, training-free optimization method that requires no modification of model weights. By analyzing self-attention maps from intermediate layers during inference, the method identifies temporally unstable regions and applies lightweight updates to the latent representations to enhance inter-frame consistency. Experiments demonstrate that this approach significantly improves motion smoothness across state-of-the-art video diffusion models such as Wan2.1 and CogVideoX, while preserving high single-frame visual quality.
📝 Abstract
Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.