MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
Cartoon animation generation faces core challenges including difficulty in modeling non-human characters, high stylistic and motion diversity, low-fidelity emotional expression, severe domain shift due to abstraction and exaggeration, and scarcity of multimodal annotated data. Method: We introduce MagicAnime—the first hierarchically annotated multimodal cartoon animation dataset—covering four task categories: image-to-video, keypoint-driven animation, audio-to-face, and video-to-face animation, with high-quality samples and fine-grained annotations. We further develop MagicAnime-Bench, a benchmark platform enabling systematic evaluation of controllability, fidelity, and generalization. Annotation efficiency and consistency are enhanced via hybrid human-curated and semi-automatic pipelines. Contribution/Results: Experiments demonstrate state-of-the-art fine-grained control across multiple generation tasks, effectively bridging semantic and motion-domain gaps between cartoon and real-world videos, thereby establishing a robust foundation for controllable, high-fidelity cartoon animation research.

Technology Category

Application Category

📝 Abstract
Generating high-quality cartoon animations multimodal control is challenging due to the complexity of non-human characters, stylistically diverse motions and fine-grained emotions. There is a huge domain gap between real-world videos and cartoon animation, as cartoon animation is usually abstract and has exaggerated motion. Meanwhile, public multimodal cartoon data are extremely scarce due to the difficulty of large-scale automatic annotation processes compared with real-life scenarios. To bridge this gap, We propose the MagicAnime dataset, a large-scale, hierarchically annotated, and multimodal dataset designed to support multiple video generation tasks, along with the benchmarks it includes. Containing 400k video clips for image-to-video generation, 50k pairs of video clips and keypoints for whole-body annotation, 12k pairs of video clips for video-to-video face animation, and 2.9k pairs of video and audio clips for audio-driven face animation. Meanwhile, we also build a set of multi-modal cartoon animation benchmarks, called MagicAnime-Bench, to support the comparisons of different methods in the tasks above. Comprehensive experiments on four tasks, including video-driven face animation, audio-driven face animation, image-to-video animation, and pose-driven character animation, validate its effectiveness in supporting high-fidelity, fine-grained, and controllable generation.
Problem

Research questions and friction points this paper is trying to address.

Addressing multimodal control challenges in cartoon animation generation
Bridging domain gap between real-world videos and cartoon animations
Providing large-scale annotated dataset for diverse animation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchically annotated multimodal dataset
Multi-task cartoon animation benchmarks
High-fidelity controllable generation methods
🔎 Similar Papers
No similar papers found.