Video Prediction Transformers without Recurrence or Convolution

📅 2024-10-07

📈 Citations: 2

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the high computational cost of RNNs and the limited receptive field and poor generalization of CNNs in video prediction, this work proposes PredFormer—a purely Transformer-based architecture. PredFormer eliminates both recurrent and convolutional components, introducing two key innovations: a gated Transformer mechanism and a novel three-dimensional (3D) self-attention module, enabling end-to-end spatiotemporal modeling of frame sequences. Furthermore, it establishes the first 3D attention analysis framework specifically designed for video temporal modeling. Extensive experiments demonstrate that PredFormer achieves state-of-the-art performance on four benchmark datasets—Kinetics-600, BAIR, KTH, and UCF101—while significantly reducing computational complexity. This work provides the first empirical validation of the superiority and efficiency of a fully Transformer-based, RNN-free and CNN-free design for video prediction.

Technology Category

Application Category

📝 Abstract

Video prediction has witnessed the emergence of RNN-based models led by ConvLSTM, and CNN-based models led by SimVP. Following the significant success of ViT, recent works have integrated ViT into both RNN and CNN frameworks, achieving improved performance. While we appreciate these prior approaches, we raise a fundamental question: Is there a simpler yet more effective solution that can eliminate the high computational cost of RNNs while addressing the limited receptive fields and poor generalization of CNNs? How far can it go with a simple pure transformer model for video prediction? In this paper, we propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. Extensive experiments demonstrate that PredFormer delivers state-of-the-art performance across four standard benchmarks. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer as a strong baseline for real-world video prediction applications. The source code and trained models will be released at https://github.com/yyyujintang/PredFormer.

Problem

Research questions and friction points this paper is trying to address.

Eliminate high computational cost of RNNs in video prediction

Address limited receptive fields and poor generalization of CNNs

Explore pure transformer model potential for video prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pure transformer model for video prediction

Gated Transformers replace RNNs and CNNs

3D Attention analysis enhances performance

🔎 Similar Papers

No similar papers found.