Long-Context Autoregressive Video Modeling with Next-Frame Prediction

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Long-range temporal modeling in video generation is hindered by visual redundancy, prohibitive computational cost, and poor extrapolation capability of autoregressive frameworks. To address these challenges, we propose Frame AutoRegressive (FAR), a novel autoregressive video generation framework. FAR introduces FlexRoPE—a flexible rotary position embedding enabling 16× temporal sequence length extrapolation—alongside a dual-window attention mechanism that jointly models short-term frame-wise fidelity and unbounded long-term temporal dependencies. Additionally, FAR incorporates visual token compression and strict causal temporal modeling to ensure temporal coherence. Extensive experiments demonstrate that FAR consistently outperforms state-of-the-art methods on both short- and long-video generation benchmarks. It significantly accelerates training convergence and markedly improves long-range temporal consistency. To our knowledge, FAR is the first framework to achieve efficient, scalable, and high-fidelity autoregressive modeling of ultra-long videos.

Technology Category

Application Category

📝 Abstract

Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.

Problem

Research questions and friction points this paper is trying to address.

Improving video generation with long-context autoregressive modeling

Addressing visual redundancy in long-context vision modeling

Reducing computational cost in long-video sequence training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frame AutoRegressive (FAR) models temporal dependencies

FlexRoPE enables extrapolation to longer contexts

Balances high-res short-term and low-res long-term contexts

🔎 Similar Papers

Video In-context Learning

2024-07-10arXiv.orgCitations: 3

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence