Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$ imes$

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Native 4K (2160×3840) video generation suffers from an efficiency bottleneck due to the quadratic growth of full-attention computational complexity with spatiotemporal resolution. To address this, we propose T3—a plug-and-play, training-free lightweighting strategy for Transformers that requires no architectural modification or retraining of pretrained models. Its core innovations are: (1) multi-scale weight-sharing windowed attention to reduce local modeling redundancy; and (2) axial-preserving hierarchical spatiotemporal blocked full attention, which retains global modeling capability along critical axes. This design enables efficient transfer of attention patterns across resolutions. On 4K-VBench, T3 achieves a +4.29 improvement in VQA score, a +0.08 gain in VTC score, and accelerates native 4K video generation by over 10×—demonstrating substantial gains in both quality and efficiency.

Technology Category

Application Category

📝 Abstract
Native 4K (2160$ imes$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $ extbf{T3}$ ($ extbf{T}$ransform $ extbf{T}$rained $ extbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $ extbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $ extbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$uparrow$ VQA and +0.08$uparrow$ VTC), it accelerates native 4K video generation by more than 10$ imes$. Project page at https://zhangzjn.github.io/projects/T3-Video
Problem

Research questions and friction points this paper is trying to address.

Accelerates 4K video generation by over 10x
Reduces computational cost of full-attention Transformers
Improves video quality while maintaining efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes pretrained full-attention Transformer forward logic
Introduces multi-scale weight-sharing window attention mechanism
Uses hierarchical blocking and axis-preserving full-attention design
🔎 Similar Papers
No similar papers found.
J
Jiangning Zhang
Youtu Lab, Tencent
Junwei Zhu
Junwei Zhu
Algorithm Engineer at Tencent
CV
T
Teng Hu
Youtu Lab, Tencent
Y
Yabiao Wang
Youtu Lab, Tencent
Donghao Luo
Donghao Luo
Youtu lab@Tencent, Shanghai Jiao Tong University
cvdeep learning
Weijian Cao
Weijian Cao
Tencent
CVCG
Z
Zhenye Gan
Youtu Lab, Tencent
Xiaobin Hu
Xiaobin Hu
Tencent Youtu Lab;Technische Universität München (TUM)
Deep learningComputer visionVLMAgents
Z
Zhucun Xue
Zhejiang University
C
Chengjie Wang
Youtu Lab, Tencent