Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) frequently rely on static frames or spurious commonsense shortcuts when reasoning about egocentric videos, failing to capture fine-grained temporal dynamics. Method: We introduce EgoTempo—the first egocentric video question-answering benchmark explicitly designed to enforce strong temporal dependency—where questions necessitate integration of the full video sequence, thereby precluding non-temporal reasoning. Our evaluation framework mandates input of complete video sequences, employs temporally sensitive question templates, and incorporates systematic ablation analyses. Results: Experiments reveal a substantial performance drop across mainstream MLLMs on EgoTempo (average accuracy <40%), exposing critical deficiencies in temporal reasoning. We publicly release the dataset, annotations, and code to establish a standardized benchmark and guide future advances in video temporal understanding.

Technology Category

Application Category

📝 Abstract

Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at https://github.com/google-research-datasets/egotempo.git.

Problem

Research questions and friction points this paper is trying to address.

Current datasets lack temporal depth in egocentric video analysis.

MLLMs perform well with minimal input, ignoring video context.

EgoTempo dataset challenges models to integrate full video information.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces EgoTempo for temporal understanding evaluation

Emphasizes integrating information across entire video

Highlights MLLMs' shortcomings in temporal reasoning

🔎 Similar Papers

MM-Ego: Towards Building Egocentric Multimodal LLMs