AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing public datasets lack character reference images and multi-shot consistency annotations, hindering coherent, character-controllable animation generation. To address this, we introduce AnimRef—the first reference-guided, multi-shot video dataset specifically designed for animation generation—featuring character reference images, hierarchical narrative scripts, and synchronized audio. We propose a three-level semantic annotation framework (story–shot–audio) and an automated visual consistency assurance pipeline. Methodologically, we design a joint encoding mechanism for reference images and historical frames, integrating multimodal large language models (MLLMs) with video diffusion models to enable conditional, shot-by-shot generation. Experiments demonstrate that our approach significantly outperforms baselines in cross-shot visual consistency and character reference fidelity, validating AnimRef’s effectiveness in supporting high-quality, controllable animation generation.

Technology Category

Application Category

📝 Abstract

Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.

Problem

Research questions and friction points this paper is trying to address.

Lack of datasets for multi-shot animation with character references

Need for coherent video generation with narrative and visual consistency

Absence of synchronized audio annotations in animation datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for consistent multi-shot animation

Hierarchical annotations for narrative and visual guidance

MLLM and diffusion models for reference-aware generation

🔎 Similar Papers

Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation