Advancing Egocentric Video Question Answering with Multimodal Large Language Models

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses core challenges in egocentric video question answering (Egocentric Video QA)—including long-horizon temporal reasoning, strong viewpoint specificity, and frequent camera motion—by constructing the low-noise QaEgo4Dv2 benchmark. It presents the first systematic evaluation of leading multimodal large language models (MLLMs), including GPT-4o, Gemini-1.5-Pro, Video-LLaVA-7B, and Qwen2-VL-7B-Instruct, under both zero-shot transfer and supervised fine-tuning settings across open-ended (OpenQA) and closed-ended (CloseQA) paradigms. Results reveal spatial reasoning and fine-grained object recognition as critical bottlenecks. Fine-tuning Video-LLaVA-7B and Qwen2-VL-7B-Instruct achieves new state-of-the-art performance: +2.6% improvement in ROUGE and METEOR scores for OpenQA, and +13% absolute accuracy gain for CloseQA.

Technology Category

Application Category

📝 Abstract

Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2 - a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate annotation noise in QaEgo4D, enabling more reliable comparison. Our results show that fine-tuned Video-LLaVa-7B and Qwen2-VL-7B-Instruct achieve new state-of-the-art performance, surpassing previous benchmarks by up to +2.6% ROUGE/METEOR (for OpenQA) and +13% accuracy (for CloseQA). We also present a thorough error analysis, indicating the model's difficulty in spatial reasoning and fine-grained object recognition - key areas for future improvement.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs on egocentric video QA tasks

Improving dataset quality for reliable model comparison

Addressing spatial reasoning and object recognition challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Multimodal Large Language Models (MLLMs)

Introduces refined dataset QaEgo4Dv2

Fine-tunes Video-LLaVa-7B and Qwen2-VL-7B-Instruct

🔎 Similar Papers

MM-Ego: Towards Building Egocentric Multimodal LLMs