EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the limited perception and reasoning capabilities of existing video-language models in fast-paced, information-dense virtual esports scenarios, compounded by the absence of specialized evaluation benchmarks. To bridge this gap, we introduce the first egocentric video question-answering benchmark tailored to first-person shooter (FPS) esports competitions. Leveraging a six-stage pipeline, we derive 1,745 high-quality question-answer pairs from professional match footage and innovatively decouple them along two orthogonal dimensions: cognitive reasoning ability and domain-specific esports knowledge. Systematic evaluation on this benchmark reveals that even state-of-the-art models achieve only 71.58% accuracy, highlighting pronounced deficiencies in tactical reasoning and understanding of fine-grained in-game actions. These findings underscore the benchmark’s critical role in guiding model improvements and enabling downstream applications in competitive gaming contexts.

Technology Category

Application Category

📝 Abstract

While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

Problem

Research questions and friction points this paper is trying to address.

egocentric video

video question answering

esports

video-LLMs

reasoning benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Egocentric Video

Video Question Answering

Esports Benchmark