Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

πŸ“… 2026-01-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses key challenges faced by multimodal large language models in complex first-person video question answering tasks such as HD-EPIC VQA, including ambiguous queries and answer choices, limited long-horizon temporal reasoning, and unstructured outputs. To overcome these limitations, the authors propose an end-to-end optimization framework that integrates query and option preprocessing, domain-adapted fine-tuning of Qwen2.5-VL, a novel Temporal Chain-of-Thought (T-CoT) prompting mechanism, and a structured post-processing strategy. This integrated approach substantially enhances the model’s capacity for multi-step reasoning over extended video sequences. Evaluated on the HD-EPIC VQA benchmark, the method achieves an accuracy of 41.6%, demonstrating the effectiveness of holistic pipeline optimization for demanding video understanding tasks.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.
Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs
Egocentric Video Understanding
Video QA
Temporal Reasoning
HD-EPIC VQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Chain-of-Thought
Multimodal LLMs
Egocentric Video Understanding
Domain-specific Fine-tuning
Video QA
πŸ”Ž Similar Papers
No similar papers found.