π€ AI Summary
DiT-based video generation models suffer from poor temporal coherence and suboptimal visual fidelity. To address this, we propose a training-free, plug-and-play post-processing enhancement method that explicitly strengthens inter-frame dynamic relationships by reweighting off-diagonal regions in temporal attention mapsβthereby improving temporal consistency and sharpness without modifying model parameters or incurring additional training overhead. This is the first approach to achieve training-agnostic enhancement for DiT video generators, offering architecture-agnostic compatibility across diverse DiT variants and requiring only a single forward pass for inference-time enhancement. Extensive experiments demonstrate significant improvements in motion coherence and detail fidelity across multiple DiT video generation models, with zero computational or memory overhead during inference. Our method establishes a new paradigm for efficient, high-quality video generation by leveraging inherent attention structures for lightweight, parameter-free temporal refinement.
π Abstract
DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.