ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

📅 2024-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-video (T2V) models often produce nearly static outputs due to structural distortions, temporal inconsistency, and insufficient motion. This paper proposes a lightweight, training-free inference-time optimization method that introduces zero parameters, incurs no additional memory or sampling overhead, and requires only a single forward pass (<2% latency increase). Our core contributions are twofold: (i) we first uncover intrinsic correlations—between temporal attention map discrepancies and dynamic distortion, and between frequency-domain energy magnitude and motion amplitude; and (ii) based on these insights, we design a dual-path optimization: temporal self-guidance to suppress inter-layer attention bias, and Fourier-domain energy modulation to enhance motion richness. Evaluated across multiple state-of-the-art T2V models, our method reduces Fréchet Video Distance (FVD) by 18.7% and improves motion score by 32%, significantly boosting visual quality and dynamic fidelity.

Technology Category

Application Category

📝 Abstract
The text-to-video (T2V) generation models, offering convenient visual creation, have recently garnered increasing attention. Despite their substantial potential, the generated videos may present artifacts, including structural implausibility, temporal inconsistency, and a lack of motion, often resulting in near-static video. In this work, we have identified a correlation between the disparity of temporal attention maps across different blocks and the occurrence of temporal inconsistencies. Additionally, we have observed that the energy contained within the temporal attention maps is directly related to the magnitude of motion amplitude in the generated videos. Based on these observations, we present ByTheWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time. Specifically, ByTheWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks. 2) Fourier-based Motion Enhancement enhances the magnitude and richness of motion by amplifying the energy of the map. Extensive experiments demonstrate that ByTheWay significantly improves the quality of text-to-video generation with negligible additional cost.
Problem

Research questions and friction points this paper is trying to address.

Enhance text-to-video quality
Reduce temporal inconsistencies
Improve motion amplitude
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Self-Guidance reduces attention map disparity
Fourier-based Motion Enhancement amplifies map energy
Training-free method enhances video quality without added parameters
🔎 Similar Papers
No similar papers found.