TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly preserving visual fidelity and motion continuity in long-video generation, this paper proposes a hierarchical frame-rate prediction framework: first generating a low-frame-rate video to capture global spatiotemporal structure, then progressively inserting intermediate frames to increase both spatial resolution and temporal density. Methodologically, we introduce a cross-frame-rate autoregressive mechanism and intra-hierarchy bidirectional attention to model long-range temporal consistency. A multi-stage frame-rate escalation strategy is adopted to enhance inter-frame coherence while maintaining parallel synthesis efficiency. Evaluated on multiple long-video generation benchmarks, our approach achieves state-of-the-art performance, significantly improving both visual quality—measured by sharpness, detail preservation, and structural integrity—and motion naturalness—assessed via optical flow smoothness and temporal plausibility.

Technology Category

Application Category

📝 Abstract
We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.
Problem

Research questions and friction points this paper is trying to address.

Proposes efficient long video generation via frame-rate prediction
Achieves temporal coherence through bidirectional attention mechanisms
Enables parallel synthesis while refining visual details progressively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts next-frame-rate for long video generation
Progressively increases frame rate to refine details
Uses bidirectional attention and autoregression across rates
🔎 Similar Papers
No similar papers found.
Y
Yukuo Ma
Fudan University
C
Cong Liu
Institute of Artificial Intelligence (TeleAI), China Telecom
Junke Wang
Junke Wang
Fudan University
Computer Vision
J
Junqi Liu
Institute of Artificial Intelligence (TeleAI), China Telecom
Haibin Huang
Haibin Huang
Principal Research Scientist at TeleAI
Computer GraphicsComputer VisionGeometric Modeling3D Deep Learning
Zuxuan Wu
Zuxuan Wu
Fudan University
C
Chi Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom