EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the limited temporal reasoning capability of large language models (LLMs) and vision-language models (VLMs) on egocentric videos. To this end, we introduce the first dynamic video question-answering benchmark grounded in Egocentric Action Scene Graphs (EASGs). Methodologically, we propose a spatiotemporally aligned, fine-grained dynamic scene graph modeling framework that generates Q&A pairs capturing complex actor–action–object spatiotemporal relations. We further establish a systematic multimodal evaluation protocol. Key contributions include: (1) the first integration of structured action scene graphs into video QA evaluation; (2) empirical identification of a >32% performance drop for LLMs/VLMs on temporal ordering questions, highlighting a critical gap in long-horizon temporal understanding; and (3) open-sourcing the complete dataset, annotations, and evaluation code to advance reproducible video–language joint reasoning research.

Technology Category

Application Category

📝 Abstract

We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: https://github.com/fpv-iplab/EASG-bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates video QA models on dynamic scene graphs

Identifies performance gap in temporal understanding

Provides benchmark for egocentric video analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic scene graphs for video Q&A

Evaluation framework for video-LLMs

Open-source benchmark for reproducibility

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding