TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the prevalent issue of temporal incoherence—manifesting as flickering, drifting, and unstable motion—in text-to-video diffusion models. It reveals, for the first time, an intrinsic link between such artifacts and the non-smoothness along the temporal diagonal of self-attention maps. To mitigate this, the authors propose a plug-and-play, training-free optimization method that requires no modification of model weights. By analyzing self-attention maps from intermediate layers during inference, the method identifies temporally unstable regions and applies lightweight updates to the latent representations to enhance inter-frame consistency. Experiments demonstrate that this approach significantly improves motion smoothness across state-of-the-art video diffusion models such as Wan2.1 and CogVideoX, while preserving high single-frame visual quality.
📝 Abstract
Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.
Problem

Research questions and friction points this paper is trying to address.

temporal coherence
video diffusion
flickering
motion instability
attention maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Coherence
Training-Free
Self-Attention Diagonals
Video Diffusion
Inference-Time Optimization