StoryMem: Multi-shot Long Video Storytelling with Memory

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving cross-shot consistency and cinematic-quality output in long-duration, multi-shot video generation. We propose Memory-to-Video, a novel paradigm that constructs a dynamic keyframe memory bank and enables explicit visual memory injection via negative RoPE offset and lightweight LoRA fine-tuning. To enhance narrative coherence and aesthetic quality, we introduce semantic keyframe selection and aesthetic preference filtering. Leveraging a pre-trained video diffusion model, shot stitching and memory fusion are performed in the latent space. Evaluated on our newly established ST-Bench benchmark, our method achieves, for the first time, controllable, high-fidelity video storytelling lasting approximately 60 seconds with multiple shots. Quantitative and qualitative results demonstrate significant improvements in cross-shot consistency, prompt fidelity, and aesthetic quality over prior approaches.

Technology Category

Application Category

📝 Abstract

Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.

Problem

Research questions and friction points this paper is trying to address.

Generates multi-shot videos with long-range consistency

Transforms single-shot models into multi-shot storytellers using memory

Ensures cross-shot consistency and aesthetic quality in videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative shot synthesis using visual memory bank

Memory injection via latent concatenation and RoPE shifts

Semantic keyframe selection with aesthetic filtering

🔎 Similar Papers

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

2024-05-22Annual Meeting of the Association for Computational LinguisticsCitations: 2

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence