🤖 AI Summary
This work addresses the challenge of accurately conveying emotion and contextual nuance in subtitle translation for low-resource Indian languages, where reliance on text alone often proves insufficient. The authors propose a lightweight, selective visual augmentation mechanism that replaces only 20–30% of low-quality subtitle segments, substantially reducing computational overhead. Their approach integrates structured visual attribute summaries—extracted via a sliding window—with free-form textual summaries generated from visual gaps, leveraging multimodal alignment and evaluation through the COMET metric. Experiments on five full-length films demonstrate consistent improvements over text-only baselines, with attribute-based, coarse-grained visual summaries proving particularly robust in recovering missing emotional and socio-pragmatic context.
📝 Abstract
Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30\% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses