🤖 AI Summary
Long video generation often suffers from content drift, temporal inconsistency, and overly smoothed dynamics due to the strong coupling between appearance and motion. This work proposes a training-free spectral reconstruction framework that, for the first time, introduces singular value decomposition into training-free long video extension. By adaptively fusing global low-rank structure and local high-rank dynamics in the spectral domain, the method employs a global branch to provide low-rank spectral guidance ensuring long-range consistency, while a local branch acts as a high-rank basis to preserve fine-grained temporal variations. Avoiding rigid feature partitioning, the approach significantly enhances generation quality on Wan2.1 and LTX-Video, achieving high visual fidelity alongside improved temporal coherence and dynamic expressiveness.
📝 Abstract
Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.