OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing multi-shot video generation methods are constrained by local temporal windows or single-frame conditioning, hindering the modeling of long-range cross-shot semantic dependencies and resulting in insufficient narrative coherence. This paper proposes an autoregressive next-shot prediction framework that formulates multi-shot video generation as a global temporal modeling task. Key contributions include: (1) a semantics-driven frame selection module that constructs a compact yet representative global memory; and (2) an importance-aware adaptive regulator enabling block-level dynamic compression and modulation of cross-shot context. Our method builds upon a pre-trained image-to-video foundation model and is fine-tuned on a high-quality multi-shot video dataset. Experiments demonstrate significant improvements in narrative coherence under both text- and image-conditioned settings compared to state-of-the-art approaches, enabling controllable and immersive long-form video generation.

Technology Category

Application Category

📝 Abstract

Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.

Problem

Research questions and friction points this paper is trying to address.

Generates coherent multi-shot videos with adaptive memory

Models long-range cross-shot context for consistent narratives

Enables controllable long-form video storytelling from text/images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive next-shot generation for multi-shot videos

Frame Selection module builds global memory from prior shots

Adaptive Conditioner performs importance-guided patchification for compact context

🔎 Similar Papers

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

2024-05-22Annual Meeting of the Association for Computational LinguisticsCitations: 2

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence