🤖 AI Summary
This work addresses the challenge of distinguishing genuine glitches from legitimate anomalous behaviors in open-ended video game recordings by proposing the GliDe framework. GliDe integrates game-aware contextual memory, a multi-agent debate-and-reflection mechanism, and an event-level temporal localization module to enable multi-perspective reasoning and precise glitch verification. The study introduces VideoGlitchBench, the first benchmark supporting joint evaluation of semantic understanding and temporal localization, and designs a glitch detection architecture endowed with agent-based reasoning capabilities. Experimental results demonstrate that GliDe significantly outperforms existing foundation models in both semantic accuracy and temporal localization precision, achieving state-of-the-art performance and highlighting the inherent challenges this task poses for multimodal large language models.
📝 Abstract
Open-ended video game glitch detection aims to identify glitches in gameplay videos, describe them in natural language, and localize when they occur. Unlike conventional game glitch understanding tasks which have largely been framed as image-level recognition or closed-form question answering, this task requires reasoning about game-specific dynamics such as mechanics, physics, rendering, animation, and expected state transitions directly over continuous gameplay videos and distinguishing true glitches from unusual but valid in-game events. To support this task, we introduce VideoGlitchBench, the first benchmark for open-ended video game glitch detection with temporal localization. VideoGlitchBench contains 5,238 gameplay videos from 120 games, each annotated with detailed glitch descriptions and precise temporal spans, enabling unified evaluation of semantic understanding and temporal grounding. We further propose GliDe, an agentic framework with three key components: a game-aware contextual memory for informed reasoning, a debate-based reflector for multi-perspective glitch detection and verification, and an event-level grounding module that recovers complete glitch intervals from fragmented temporal evidence. We also design a task-specific evaluation protocol that jointly measures semantic fidelity and temporal accuracy. Experiments show that this task remains highly challenging for current multimodal models, while GliDe achieves substantially stronger performance than corresponding vanilla model baselines.