Next Block Prediction: Video Generation via Semi-Auto-Regressive Modeling

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of autoregressive (AR) methods in video generation—namely, weak bidirectional dependency modeling and slow inference—the paper proposes a semi-autoregressive Next-Block Prediction (NBP) framework. Instead of modeling token-level sequences, NBP operates on equal-sized spatiotemporal blocks, enabling intra-block bidirectional attention and inter-block parallel prediction. Its core contribution is the introduction of a novel block-level semi-autoregressive paradigm that jointly optimizes modeling capacity and computational efficiency. Quantitatively, on UCF101 and Kinetics-600, the framework reduces Fréchet Video Distance (FVD) to 103.3 (+4.4) and 25.5, respectively; with a 3B-parameter model, FVD further improves to 55.3 and 19.5. For 128×128 video generation, it achieves 8.89 frames/second—11× faster than baseline AR methods—demonstrating substantial co-improvement in both generation quality and speed.

Technology Category

Application Category

📝 Abstract
Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.
Problem

Research questions and friction points this paper is trying to address.

Improving video generation speed
Enhancing spatial dependency capture
Reducing inference steps efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-autoregressive framework for video generation
Bidirectional attention within blocks
Parallel token prediction for efficiency
🔎 Similar Papers
2024-07-10arXiv.orgCitations: 3