MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cartoon animation generation faces core challenges including difficulty in modeling non-human characters, high stylistic and motion diversity, low-fidelity emotional expression, severe domain shift due to abstraction and exaggeration, and scarcity of multimodal annotated data. Method: We introduce MagicAnime—the first hierarchically annotated multimodal cartoon animation dataset—covering four task categories: image-to-video, keypoint-driven animation, audio-to-face, and video-to-face animation, with high-quality samples and fine-grained annotations. We further develop MagicAnime-Bench, a benchmark platform enabling systematic evaluation of controllability, fidelity, and generalization. Annotation efficiency and consistency are enhanced via hybrid human-curated and semi-automatic pipelines. Contribution/Results: Experiments demonstrate state-of-the-art fine-grained control across multiple generation tasks, effectively bridging semantic and motion-domain gaps between cartoon and real-world videos, thereby establishing a robust foundation for controllable, high-fidelity cartoon animation research.

Technology Category

Application Category

📝 Abstract
Generating high-quality cartoon animations multimodal control is challenging due to the complexity of non-human characters, stylistically diverse motions and fine-grained emotions. There is a huge domain gap between real-world videos and cartoon animation, as cartoon animation is usually abstract and has exaggerated motion. Meanwhile, public multimodal cartoon data are extremely scarce due to the difficulty of large-scale automatic annotation processes compared with real-life scenarios. To bridge this gap, We propose the MagicAnime dataset, a large-scale, hierarchically annotated, and multimodal dataset designed to support multiple video generation tasks, along with the benchmarks it includes. Containing 400k video clips for image-to-video generation, 50k pairs of video clips and keypoints for whole-body annotation, 12k pairs of video clips for video-to-video face animation, and 2.9k pairs of video and audio clips for audio-driven face animation. Meanwhile, we also build a set of multi-modal cartoon animation benchmarks, called MagicAnime-Bench, to support the comparisons of different methods in the tasks above. Comprehensive experiments on four tasks, including video-driven face animation, audio-driven face animation, image-to-video animation, and pose-driven character animation, validate its effectiveness in supporting high-fidelity, fine-grained, and controllable generation.
Problem

Research questions and friction points this paper is trying to address.

Addressing multimodal control challenges in cartoon animation generation
Bridging domain gap between real-world videos and cartoon animations
Providing large-scale annotated dataset for diverse animation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchically annotated multimodal dataset
Multi-task cartoon animation benchmarks
High-fidelity controllable generation methods
🔎 Similar Papers
No similar papers found.
S
Shuolin Xu
National Centre for Computer Animation, Bournemouth University
Bingyuan Wang
Bingyuan Wang
The Hong Kong University of Science and Technology (Guangzhou)
Generative AIAffective ComputingImmersive StorytellingCreative IntelligenceCultural Heritage
Zeyu Cai
Zeyu Cai
Institute of Heavy Ion Physics, Peking University
AI for SciencePlasma PhysicsAI AgentsNumber Theory
F
Fangteng Fu
Hong Kong University of Science and Technology (Guangzhou)
Yue Ma
Yue Ma
Bytedance
NLPDialogue SystemLLM
T
Tongyi Lee
Department of Computer Science and Information Engineering, National Cheng Kung University
Hongchuan Yu
Hongchuan Yu
National Centre for Computer Animation, Bournemouth University
Z
Zeyu Wang
Hong Kong University of Science and Technology (Guangzhou); Hong Kong University of Science and Technology