Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limited evaluation of intermediate reasoning processes in existing egocentric video understanding models, which predominantly rely on textual outputs and thus struggle to capture complex visual reasoning capabilities. To bridge this gap, the authors introduce a novel multimodal, multi-step reasoning benchmark built upon high-quality egocentric videos, featuring human-annotated dense spatiotemporal reasoning trajectories and object masks. This enables fine-grained analysis of models’ visual attention—specifically, when and where they look during reasoning. The benchmark integrates multimodal question answering with large-model prompting techniques. Experimental results reveal a significant performance gap between state-of-the-art models and human responses, while also demonstrating that incorporating spatiotemporal prompts substantially improves model performance, thereby validating the benchmark’s effectiveness in advancing fine-grained visual understanding.

📝 Abstract

Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.

Problem

Research questions and friction points this paper is trying to address.

egocentric video understanding

video reasoning

spatiotemporal reasoning

multimodal question answering

reasoning evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric video understanding

spatiotemporal reasoning

multimodal question answering