Can Text-to-Video Generation help Video-Language Alignment?

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing video-language alignment models rely on manually constructed negative captions, which often introduce language bias—certain semantics appear only in negative forms without corresponding visual grounding, and real-world data struggle to cover fine-grained negative sample spaces. To address this, we propose SynViTA, the first framework to leverage controllable synthetic videos for mitigating such bias. SynViTA introduces a dynamic semantic weighting mechanism that adaptively fuses synthetic videos based on text-video similarity, and incorporates a semantic consistency loss to suppress visual noise in generated videos. Crucially, our method requires no additional human annotations. It significantly improves model sensitivity to subtle textual distinctions. Evaluated on four challenging benchmarks—VideoCon, SSv2-Temporal, SSv2-Events, and ATP-Hard—SynViTA achieves new state-of-the-art average performance, demonstrating the effectiveness and generalizability of synthetic videos as a scalable, controllable source of negative samples.

Technology Category

Application Category

📝 Abstract

Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.

Problem

Research questions and friction points this paper is trying to address.

Negative captions introduce linguistic biases in video-language alignment

Existing databases lack fine-grained variations for negative captions

Synthetic videos may help but introduce noise and inconsistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic videos to address negative caption bias

Introduces SynViTA for dynamic video weighting

Employs semantic consistency loss for fine-grained differences

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs