ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

πŸ“… 2025-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current text-to-video diffusion models generate only single-shot videos, lacking support for multi-shot videos featuring discrete transitions of the same character across diverse backgrounds or actions. To address this, we propose the first end-to-end text-driven multi-shot video generation framework. Our method introduces transition tokens and shot-aware local attention masking to enable fine-grained control over shot count, duration, and content. We further design an automated pipeline that synthesizes multi-shot training data from pre-existing single-shot videos. Leveraging a pretrained video diffusion model, our approach achieves high-quality multi-shot generation with only lightweight fine-tuning (a few thousand steps). Experiments demonstrate significant improvements over baselines in shot consistency, text–video alignment, and controllability. Notably, our method is the first to generate coherent multi-shot videos in a single diffusion inference pass.

Technology Category

Application Category

πŸ“ Abstract
Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines. You can find more details in https://shotadapter.github.io/
Problem

Research questions and friction points this paper is trying to address.

Generating multi-shot videos with consistent characters and backgrounds
Enabling shot-specific control over duration and content
Creating transitions between distinct activities in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates transition token for shot control
Uses local attention masking strategy
Constructs multi-shot dataset from single-shot videos
πŸ”Ž Similar Papers
No similar papers found.