FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Training Transformer-based video generators end-to-end at ultra-high resolutions is infeasible due to the quadratic computational complexity of self-attention. This paper proposes a fine-tuning-free inference-time super-resolution method. Our approach introduces three key innovations: (1) inward sliding-window attention, which preserves the original training-scale receptive field while enabling resolution scaling beyond training limits; (2) a dual-path architecture with cross-attention coverage, jointly ensuring fine-grained detail fidelity and global spatiotemporal coherence; and (3) a cross-attention caching strategy that substantially accelerates high-resolution inference. Deployed directly on pre-trained video diffusion Transformers, our method achieves state-of-the-art performance on VBench—outperforming several training-based baselines—while generating ultra-high-resolution videos with sharp details and consistent spatiotemporal dynamics. It enables efficient, training-free, end-to-end inference.

Technology Category

Application Category

📝 Abstract

The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token's training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: https://github.com/WillWu111/FreeSwim

Problem

Research questions and friction points this paper is trying to address.

Overcoming quadratic complexity in attention for ultra-high-resolution video generation

Preventing repetitive content and ensuring global coherence in generated videos

Achieving training-free synthesis while maintaining visual fidelity and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inward sliding window attention for high-resolution generation

Dual-path pipeline with cross-attention override strategy

Cross-attention caching to avoid full 3D computation

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling