TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal machine translation (MMT) datasets are largely confined to static images or short videos, lacking documentary-grade video data spanning diverse domains—thus hindering real-world translation. To address this, we propose TopicVD: the first video-guided MMT dataset specifically designed for documentary translation, covering eight thematic domains and providing bilingual video–subtitle pairs with global contextual annotations to support domain adaptation and long-range modeling. Methodologically, we introduce a novel topic-driven benchmark, design a cross-modal bidirectional attention mechanism for explicit video–text semantic alignment, and integrate hierarchical topic construction, precise video–subtitle temporal alignment, and context preservation. Experiments show that visual cues yield average BLEU gains of 1.8–3.2; global context contributes +2.4 BLEU; and cross-domain performance degrades by −4.7 BLEU, underscoring the necessity of domain adaptation.

Technology Category

Application Category

📝 Abstract
Most existing multimodal machine translation (MMT) datasets are predominantly composed of static images or short video clips, lacking extensive video data across diverse domains and topics. As a result, they fail to meet the demands of real-world MMT tasks, such as documentary translation. In this study, we developed TopicVD, a topic-based dataset for video-supported multimodal machine translation of documentaries, aiming to advance research in this field. We collected video-subtitle pairs from documentaries and categorized them into eight topics, such as economy and nature, to facilitate research on domain adaptation in video-guided MMT. Additionally, we preserved their contextual information to support research on leveraging the global context of documentaries in video-guided MMT. To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module. Extensive experiments on the TopicVD dataset demonstrate that visual information consistently improves the performance of the NMT model in documentary translation. However, the MMT model's performance significantly declines in out-of-domain scenarios, highlighting the need for effective domain adaptation methods. Additionally, experiments demonstrate that global context can effectively improve translation performance. % Dataset and our implementations are available at https://github.com/JinzeLv/TopicVD
Problem

Research questions and friction points this paper is trying to address.

Lack of diverse video data for multimodal machine translation
Need for domain adaptation in documentary translation tasks
Improving translation by leveraging global context and cross-modal semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

TopicVD dataset for video-guided MMT documentaries
Cross-modal bidirectional attention module
Preserved global context for translation
🔎 Similar Papers
No similar papers found.
J
Jinze Lv
College of Application and Technology, Shenzhen University, China
J
Jian Chen
College of Application and Technology, Shenzhen University, China
Z
Zi Long
College of Big Data and Internet, Shenzhen Technology University, China
Xianghua Fu
Xianghua Fu
Shenzhen Technology University
Machine LearningNatural Language Processing
Yin Chen
Yin Chen
Lecturer in Mathematics at University of Saskatchewan
Invariant theoryLie theoryCommutative algebraApplied algebraic geometry