🤖 AI Summary
This work addresses the limitations of traditional video translation pipelines, which rely on cascaded stages and often suffer from insufficient semantic fidelity, temporal misalignment, and inconsistent speaker identity and emotion—particularly in multi-speaker scenarios. To overcome these challenges, the authors propose a unified video translation framework grounded in multimodal large language models (MLLMs). They introduce a systematic taxonomy for video translation and design a novel tripartite architecture comprising a Semantic Reasoner, an Expressive Executor, and a Visual Synthesizer, which jointly enable cross-modal understanding, generation, and alignment. By integrating semantic reasoning, controllable speech synthesis, and high-fidelity lip-synced video generation, the method achieves state-of-the-art or competitive performance in translation quality and multidimensional consistency, while demonstrating markedly improved robustness in zero-shot and multi-speaker settings.
📝 Abstract
Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.