One-Minute Video Generation with Test-Time Training

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Long-term video generation suffers from inefficient self-attention computation in Transformers and insufficient temporal expressiveness of alternative architectures (e.g., Mamba), leading to narrative incoherence across scenes. Method: We propose the first video generation framework integrating Test-Time Training (TTT) layers, which employ learnable neural networks to model dynamic hidden states. Our approach augments a pre-trained 5B-parameter Transformer with TTT modules to enhance temporal modeling and cross-scene consistency. Contribution/Results: Evaluated on a novel, in-house Tom and Jerry storyboard-video paired dataset, our model generates one-minute high-coherence narrative videos. Human evaluation (100 videos per method) yields an Elo score 34 points higher than Mamba-2 and other baselines. The generated videos exhibit strong narrative logicality, with only minor, controllable visual artifacts.

Technology Category

Application Category

📝 Abstract

Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

Problem

Research questions and friction points this paper is trying to address.

Improving long-context video generation with expressive hidden states

Enhancing multi-scene story coherence in one-minute videos

Addressing inefficiency in self-attention for video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Training layers enhance Transformer expressiveness

TTT layers enable one-minute video generation

Neural network hidden states improve multi-scene storytelling

🔎 Similar Papers

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way