EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

๐Ÿ“… 2026-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

191K/year
๐Ÿค– AI Summary
Existing first-person video benchmarks struggle to evaluate modelsโ€™ fine-grained, action-centric reasoning capabilities and lack mechanisms to verify whether such reasoning is grounded in explicit spatiotemporal evidence. To address this gap, this work proposes the first multimodal large language model benchmark that supports verifiable, fine-grained chain-of-action reasoning. Leveraging a spatiotemporal scene graph (STSG)-guided data generation framework augmented with expert annotations, the authors construct a dataset of 3,172 question-answer pairs spanning perception, retrospection, prediction, and high-level reasoning. Experiments reveal that while current models often produce correct answers, their explanations frequently misalign with actual evidence, underscoring the benchmarkโ€™s unique value in assessing reasoning groundedness and providing a reliable platform for future research.
๐Ÿ“ Abstract
The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.
Problem

Research questions and friction points this paper is trying to address.

egocentric video understanding
grounded reasoning
operation-centric reasoning
verifiable rationale
multimodal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoCoT-Bench
operation-centric reasoning
grounded chain-of-thought
spatio-temporal scene graphs
egocentric video understanding