How Far Can VLMs Go for Visual Bug Detection? Studying 19,738 Keyframes from 41 Hours of Gameplay Videos

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the labor-intensive and error-prone task of visual defect detection in long-term video game quality assurance, which demands automated solutions. It presents the first systematic evaluation of off-the-shelf vision-language models (VLMs) for visual bug detection on a large-scale, real-world industrial QA dataset comprising 100 gameplay videos (41 hours total) with 19,738 annotated keyframes. The work proposes two zero-shot prompt enhancement strategies—namely, a two-stage discriminative model and metadata augmentation leveraging historical bug reports. Experimental results show that baseline VLMs already achieve 0.72 accuracy and 0.50 precision, while the proposed enhancements yield only marginal performance gains at the cost of substantially increased computational overhead and output instability, thereby revealing critical limitations of current approaches for practical deployment.

Technology Category

Application Category

📝 Abstract

Video-based quality assurance (QA) for long-form gameplay video is labor-intensive and error-prone, yet valuable for assessing game stability and visual correctness over extended play sessions. Vision language models (VLMs) promise general-purpose visual reasoning capabilities and thus appear attractive for detecting visual bugs directly from video frames. Recent benchmarks suggest that VLMs can achieve promising results in detecting visual glitches on curated datasets. Building on these findings, we conduct a real-world study using industrial QA gameplay videos to evaluate how well VLMs perform in practical scenarios. Our study samples keyframes from long gameplay videos and asks a VLM whether each keyframe contains a bug. Starting from a single-prompt baseline, the model achieves a precision of 0.50 and an accuracy of 0.72. We then examine two common enhancement strategies used to improve VLM performance without fine-tuning: (1) a secondary judge model that re-evaluates VLM outputs, and (2) metadata-augmented prompting through the retrieval of prior bug reports. Across \textbf{100 videos} totaling \textbf{41 hours} and \textbf{19,738 keyframes}, these strategies provide only marginal improvements over the simple baseline, while introducing additional computational cost and output variance. Our findings indicate that off-the-shelf VLMs are already capable of detecting a certain range of visual bugs in QA gameplay videos, but further progress likely requires hybrid approaches that better separate textual and visual anomaly detection.

Problem

Research questions and friction points this paper is trying to address.

visual bug detection

vision language models

gameplay video QA

visual anomaly detection

long-form video analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Models

Visual Bug Detection

Gameplay Video QA