MoCha: Towards Movie-Grade Talking Character Synthesis

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Existing video generation methods improve motion realism but struggle with character-driven narrative generation—particularly lacking full-body, multi-character coordinated animation. This paper introduces the first cinematic narrative-oriented, speech-and-text-driven full-body character animation framework. First, we propose a speech-video windowed attention mechanism to enhance cross-modal temporal alignment. Second, we introduce a novel joint training strategy leveraging both speech and text labels. Third, we design a structured character-label prompting template that explicitly models multi-character turn-taking and scene coherence. Our method achieves breakthroughs in full-body motion modeling, multi-character coordination, and narrative consistency, substantially outperforming state-of-the-art approaches. Comprehensive human evaluations and benchmark tests yield superior performance, establishing a new standard for AI-driven film and animation production.

Technology Category

Application Category

📝 Abstract

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

Problem

Research questions and friction points this paper is trying to address.

Generating full-portrait talking characters from speech and text

Aligning speech and video tokens for precise synchronization

Enabling multi-character conversations with cinematic coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-video window attention for synchronization

Joint training with speech and text data

Structured prompts for multi-character conversations

🔎 Similar Papers

EmoVOCA: Speech-Driven Emotional 3D Talking Heads