TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
To address the challenge of jointly preserving visual fidelity and motion continuity in long-video generation, this paper proposes a hierarchical frame-rate prediction framework: first generating a low-frame-rate video to capture global spatiotemporal structure, then progressively inserting intermediate frames to increase both spatial resolution and temporal density. Methodologically, we introduce a cross-frame-rate autoregressive mechanism and intra-hierarchy bidirectional attention to model long-range temporal consistency. A multi-stage frame-rate escalation strategy is adopted to enhance inter-frame coherence while maintaining parallel synthesis efficiency. Evaluated on multiple long-video generation benchmarks, our approach achieves state-of-the-art performance, significantly improving both visual quality—measured by sharpness, detail preservation, and structural integrity—and motion naturalness—assessed via optical flow smoothness and temporal plausibility.

Technology Category

Application Category

📝 Abstract
We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.
Problem

Research questions and friction points this paper is trying to address.

Proposes efficient long video generation via frame-rate prediction
Achieves temporal coherence through bidirectional attention mechanisms
Enables parallel synthesis while refining visual details progressively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts next-frame-rate for long video generation
Progressively increases frame rate to refine details
Uses bidirectional attention and autoregression across rates