Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing multimodal large language models are limited in long-form first-person video understanding due to constrained context lengths and insufficient fine-grained visual grounding, resulting in subpar performance on the HD-EPIC video question answering (VQA) task. This work proposes a dual-evidence reasoning framework that decouples long-video reasoning into structured semantic evidence and fine-grained visual evidence. The former captures global procedural structure through a coarse-to-fine modeling strategy, while the latter preserves object-centric details by integrating bounding boxes with object-level visual embeddings. During inference, the model dynamically retrieves and fuses both evidence types based on the query. This approach is the first to explicitly and jointly leverage complementary evidence sources, enabling efficient and interpretable long-video VQA, and achieves competitive performance across multiple subtasks of HD-EPIC-VQA.

📝 Abstract

Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.

Problem

Research questions and friction points this paper is trying to address.

long-video reasoning

multimodal large language models

visual grounding

video question answering

egocentric videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic evidence

visual evidence

long-video reasoning