ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenge of multi-shot video extrapolation (MSVE) in minute-scale long video generation—namely, maintaining initial conditions, advancing narrative progression, and preserving cinematic structure under a constrained single-generation budget. To this end, the authors propose the Recursive Context Allocation (ReCA) framework, which recursively decomposes the task into contextually bounded subproblems via a hierarchical planning-and-generation architecture. ReCA leverages a frozen short-video generator and propagates structured states across time. The study formally defines the MSVE task for the first time, revealing that failures in long video generation primarily stem from improper context allocation rather than sequence length limitations. Experiments on the newly introduced MSVE-Bench benchmark and NB-Q protocol demonstrate that ReCA significantly outperforms existing methods, achieving 8%–16% higher average normalized scores and 28%–43% improved multi-shot consistency.

📝 Abstract

Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at https://reca.vmv.re.

Problem

Research questions and friction points this paper is trying to address.

Multi-Shot Video Extrapolation

Long Video Generation

Context Allocation

Cinematic Structure

Temporal Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive Context Allocation

Multi-Shot Video Extrapolation

Long Video Generation