On the Consistency of Video Large Language Models in Temporal Comprehension

📅 2024-11-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing video large language models (Video-LLMs) exhibit poor temporal consistency in time localization tasks, demonstrating high sensitivity to variations in video content, linguistic queries, and task configurations. To address this, we propose Event-Time Verification Consistency Fine-tuning (ETV-CFT), the first method explicitly enforcing temporal consistency via event-level time verification. Our approach introduces a probe-driven evaluation framework that quantifies model stability in response to initial temporal anchors, and designs an event-level time verification loss coupled with contrastive learning to explicitly regularize temporal coherence—without compromising localization accuracy. Extensive experiments demonstrate that ETV-CFT improves temporal localization consistency by 23.6% on average across multiple Video-LLMs, while delivering consistent performance gains on mainstream benchmarks. The code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code are open-sourced at https://github.com/minjoong507/Consistency-of-Video-LLM.

Problem

Research questions and friction points this paper is trying to address.

Assessing consistency in Video-LLMs for temporal comprehension.

Identifying deficiencies in Video-LLMs' response consistency.

Proposing event temporal verification to improve consistency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes verify model's temporal grounding consistency.

Event temporal verification tuning enhances consistency.

Open-sourced data and code for reproducibility.

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs