VideoQA in the Era of LLMs: An Empirical Study

📅 2024-08-08

🏛️ International Journal of Computer Vision

📈 Citations: 13

✨ Influential: 1

career value

194K/year

🤖 AI Summary

This study systematically evaluates video large language models (Video-LLMs) on temporal reasoning, robustness, and interpretability in video question answering (VideoQA). We conduct controlled, multi-benchmark experiments—including temporal localization evaluation, adversarial video perturbation testing, and sensitivity analysis to candidate answer/question perturbations. Our key finding is that, despite strong performance on standard VideoQA benchmarks, Video-LLMs exhibit severe deficits in temporal reasoning: they are insensitive to meaningful video content perturbations yet highly sensitive to minor variations in candidate answers or questions. This paradox challenges the assumption that Video-LLMs possess human-like temporal understanding and generalization capability. It further reveals fundamental limitations in causal reasoning and evidence-based justification generation inherent to current architectures. Our results provide critical empirical grounding for future work—particularly the integration of explicit temporal modeling and verifiable, stepwise reasoning mechanisms—to address these deficiencies.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs' QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Video-LLMs' performance in VideoQA tasks

Identifying failure modes in temporal reasoning and robustness

Assessing interpretability and generalization gaps in Video-LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-LLMs analyze contextual cues in videos

Models struggle with temporal content reasoning

Video-LLMs lack robustness to adversarial perturbations

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs