FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to film and television dubbing are constrained by small-scale, sparsely annotated datasets limited to single-speaker monologues and rely heavily on lip-region modeling, struggling to handle complex scenes and underperforming in lip-sync accuracy, voice timbre preservation, and emotional expressiveness. This work proposes an end-to-end pipeline for automatically constructing a large-scale Chinese dubbing dataset and introduces, for the first time, a richly annotated, multi-scenario dubbing benchmark. Building upon this dataset, we develop a zero-shot dubbing framework driven by a multimodal large language model (MLLM) that leverages full-frame visual context, enabling robust performance across diverse cinematic scenarios—including monologues, dialogues, and multi-speaker interactions. Experimental results demonstrate that our method outperforms state-of-the-art approaches in speech quality, lip-sync fidelity, voice timbre reproduction, and instruction following.

Technology Category

Application Category

📝 Abstract
Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at https://anonymous.4open.science/w/FunCineForge.
Problem

Research questions and friction points this paper is trying to address.

movie dubbing
multimodal dataset
lip sync
timbre transfer
character identity
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot movie dubbing
multimodal dataset pipeline
MLLM-based dubbing model
lip sync
timbre transfer
🔎 Similar Papers
No similar papers found.