🤖 AI Summary
This work addresses core challenges in egocentric video question answering (Egocentric Video QA)—including long-horizon temporal reasoning, strong viewpoint specificity, and frequent camera motion—by constructing the low-noise QaEgo4Dv2 benchmark. It presents the first systematic evaluation of leading multimodal large language models (MLLMs), including GPT-4o, Gemini-1.5-Pro, Video-LLaVA-7B, and Qwen2-VL-7B-Instruct, under both zero-shot transfer and supervised fine-tuning settings across open-ended (OpenQA) and closed-ended (CloseQA) paradigms. Results reveal spatial reasoning and fine-grained object recognition as critical bottlenecks. Fine-tuning Video-LLaVA-7B and Qwen2-VL-7B-Instruct achieves new state-of-the-art performance: +2.6% improvement in ROUGE and METEOR scores for OpenQA, and +13% absolute accuracy gain for CloseQA.
📝 Abstract
Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2 - a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate annotation noise in QaEgo4D, enabling more reliable comparison. Our results show that fine-tuned Video-LLaVa-7B and Qwen2-VL-7B-Instruct achieve new state-of-the-art performance, surpassing previous benchmarks by up to +2.6% ROUGE/METEOR (for OpenQA) and +13% accuracy (for CloseQA). We also present a thorough error analysis, indicating the model's difficulty in spatial reasoning and fine-grained object recognition - key areas for future improvement.