TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video-text retrieval suffers from heavy reliance on large-scale annotated video data for modeling temporal sensitivity. Method: This paper proposes TARA, a lightweight adaptation framework that enables zero-shot transfer of multimodal large language models (MLLMs) into time-aware retrieval models—without requiring video frames or explicit temporal modules—via prompt engineering and contrastive learning. Contributions/Results: (1) We introduce ChiralBench, the first time-aware retrieval benchmark featuring chiral-action hard negatives; (2) We achieve the first zero-shot temporal transfer of MLLMs; (3) We serendipitously unlock cross-dimensional generalization capabilities—including negation awareness and verb/adverb understanding. Experiments show TARA achieves zero-shot state-of-the-art (SOTA) performance on ChiralBench, and also attains SOTA on NegBench and fine-grained semantic understanding tasks.

Technology Category

Application Category

📝 Abstract
Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.
Problem

Research questions and friction points this paper is trying to address.

Adapts MLLMs for time-aware video-text embedding without video data
Proposes benchmark with chiral actions to evaluate temporal understanding
Enhances negation, verb, and adverb comprehension in video retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts MLLMs without video data
Uses temporally opposite actions as negatives
Achieves state-of-the-art zero-shot performance
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30