🤖 AI Summary
Existing multimodal large language models are limited in long-form first-person video understanding due to constrained context lengths and insufficient fine-grained visual grounding, resulting in subpar performance on the HD-EPIC video question answering (VQA) task. This work proposes a dual-evidence reasoning framework that decouples long-video reasoning into structured semantic evidence and fine-grained visual evidence. The former captures global procedural structure through a coarse-to-fine modeling strategy, while the latter preserves object-centric details by integrating bounding boxes with object-level visual embeddings. During inference, the model dynamically retrieves and fuses both evidence types based on the query. This approach is the first to explicitly and jointly leverage complementary evidence sources, enabling efficient and interpretable long-video VQA, and achieves competitive performance across multiple subtasks of HD-EPIC-VQA.
📝 Abstract
Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.