Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses key challenges faced by multimodal large language models in complex first-person video question answering tasks such as HD-EPIC VQA, including ambiguous queries and answer choices, limited long-horizon temporal reasoning, and unstructured outputs. To overcome these limitations, the authors propose an end-to-end optimization framework that integrates query and option preprocessing, domain-adapted fine-tuning of Qwen2.5-VL, a novel Temporal Chain-of-Thought (T-CoT) prompting mechanism, and a structured post-processing strategy. This integrated approach substantially enhances the model’s capacity for multi-step reasoning over extended video sequences. Evaluated on the HD-EPIC VQA benchmark, the method achieves an accuracy of 41.6%, demonstrating the effectiveness of holistic pipeline optimization for demanding video understanding tasks.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.

Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs

Egocentric Video Understanding

Video QA

Temporal Reasoning

HD-EPIC VQA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Chain-of-Thought

Multimodal LLMs

Egocentric Video Understanding