Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$ imes$

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Native 4K (2160×3840) video generation suffers from an efficiency bottleneck due to the quadratic growth of full-attention computational complexity with spatiotemporal resolution. To address this, we propose T3—a plug-and-play, training-free lightweighting strategy for Transformers that requires no architectural modification or retraining of pretrained models. Its core innovations are: (1) multi-scale weight-sharing windowed attention to reduce local modeling redundancy; and (2) axial-preserving hierarchical spatiotemporal blocked full attention, which retains global modeling capability along critical axes. This design enables efficient transfer of attention patterns across resolutions. On 4K-VBench, T3 achieves a +4.29 improvement in VQA score, a +0.08 gain in VTC score, and accelerates native 4K video generation by over 10×—demonstrating substantial gains in both quality and efficiency.

Technology Category

Application Category

📝 Abstract

Native 4K (2160$ imes$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $ extbf{T3}$ ($ extbf{T}$ransform $ extbf{T}$rained $ extbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $ extbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $ extbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$uparrow$ VQA and +0.08$uparrow$ VTC), it accelerates native 4K video generation by more than 10$ imes$. Project page at https://zhangzjn.github.io/projects/T3-Video

Problem

Research questions and friction points this paper is trying to address.

Accelerates 4K video generation by over 10x

Reduces computational cost of full-attention Transformers

Improves video quality while maintaining efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes pretrained full-attention Transformer forward logic

Introduces multi-scale weight-sharing window attention mechanism

Uses hierarchical blocking and axis-preserving full-attention design

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling