RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing vision-language models exhibit insufficient robustness in detecting stutters in real-world gameplay videos due to scene variations. This work proposes RESP, a multi-frame framework that reframes stutter detection as an intra-video comparison task between reference and test frames. By employing sequential prompting to guide vision-language models in zero-shot frame-pair reasoning and aggregating frame-level predictions into stable video-level judgments—without any fine-tuning—RESP achieves consistent performance gains. The core innovations include a reference-guided prompting mechanism and an inter-frame contrastive paradigm. Additionally, the authors introduce RefGlitch, the first annotated dataset encompassing five distinct types of stutters. Experiments demonstrate that RESP significantly improves frame-level accuracy and consistently enhances video-level performance across five vision-language models and three datasets, including two real-world gameplay benchmarks.

Technology Category

Application Category

📝 Abstract

Visual glitches in video games degrade player experience and perceived quality, yet manual quality assurance cannot scale to the growing test surface of modern game development. Prior automation efforts, particularly those using vision-language models (VLMs), largely operate on single frames or rely on limited video-level baselines that struggle under realistic scene variation, making robust video-level glitch detection challenging. We present RESP, a practical multi-frame framework for gameplay glitch detection with VLMs. Our key idea is reference-guided prompting: for each test frame, we select a reference frame from earlier in the same video, establishing a visual baseline and reframing detection as within-video comparison rather than isolated classification. RESP sequentially prompts the VLM with reference/test pairs and aggregates noisy frame predictions into a stable video-level decision without fine-tuning the VLM. To enable controlled analysis of reference effects, we introduce RefGlitch, a synthetic dataset of manually labeled reference/test frame pairs with balanced coverage across five glitch types. Experiments across five VLMs and three datasets (one synthetic, two real-world) show that reference guidance consistently strengthens frame-level detection and that the improved frame-level evidence reliably transfers to stronger video-level triage under realistic QA conditions. Code and data are available at: \href{https://github.com/PipiZong/RESP_code.git}{this https URL}.

Problem

Research questions and friction points this paper is trying to address.

visual glitch detection

video games

vision-language models

video-level analysis

quality assurance

Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-guided prompting

visual glitch detection

vision-language models