FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Training Transformer-based video generators end-to-end at ultra-high resolutions is infeasible due to the quadratic computational complexity of self-attention. This paper proposes a fine-tuning-free inference-time super-resolution method. Our approach introduces three key innovations: (1) inward sliding-window attention, which preserves the original training-scale receptive field while enabling resolution scaling beyond training limits; (2) a dual-path architecture with cross-attention coverage, jointly ensuring fine-grained detail fidelity and global spatiotemporal coherence; and (3) a cross-attention caching strategy that substantially accelerates high-resolution inference. Deployed directly on pre-trained video diffusion Transformers, our method achieves state-of-the-art performance on VBench—outperforming several training-based baselines—while generating ultra-high-resolution videos with sharp details and consistent spatiotemporal dynamics. It enables efficient, training-free, end-to-end inference.

Technology Category

Application Category

📝 Abstract
The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token's training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: https://github.com/WillWu111/FreeSwim
Problem

Research questions and friction points this paper is trying to address.

Overcoming quadratic complexity in attention for ultra-high-resolution video generation
Preventing repetitive content and ensuring global coherence in generated videos
Achieving training-free synthesis while maintaining visual fidelity and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inward sliding window attention for high-resolution generation
Dual-path pipeline with cross-attention override strategy
Cross-attention caching to avoid full 3D computation
🔎 Similar Papers
No similar papers found.
Y
Yunfeng Wu
School of Artificial Intelligence, Shanghai Jiao Tong University
J
Jiayi Song
School of Artificial Intelligence, Shanghai Jiao Tong University
Zhenxiong Tan
Zhenxiong Tan
National University of Singapore
Z
Zihao He
School of Artificial Intelligence, Shanghai Jiao Tong University
Songhua Liu
Songhua Liu
Shanghai Jiao Tong University
Computer VisionMachine Learning