🤖 AI Summary
Manual review of gameplay-rich bug reports in game development is inefficient and poorly scalable due to the sheer volume of video evidence.
Method: We propose the first vision-language semantic matching framework for game bug video understanding, integrating FFmpeg-based keyframe extraction with the GPT-4o multimodal model to achieve fine-grained semantic alignment and ranking between video frames and natural language bug descriptions—enabling automatic identification of the most representative bug frames.
Contribution/Results: This work pioneers the application of vision-language models in game QA, supporting cross-modal, semantics-driven keyframe selection. Evaluated on a real-world FPS game dataset, our approach achieves 0.89 accuracy and 0.79 F1 score, demonstrating strong generalization across common defect categories—including lighting, physics, and UI bugs—while significantly reducing manual review effort.
📝 Abstract
Modern game studios deliver new builds and patches at a rapid pace, generating thousands of bug reports, many of which embed gameplay videos. To verify and triage these bug reports, developers must watch the submitted videos. This manual review is labour-intensive, slow, and hard to scale. In this paper, we introduce an automated pipeline that reduces each video to a single frame that best matches the reported bug description, giving developers instant visual evidence that pinpoints the bug.
Our pipeline begins with FFmpeg for keyframe extraction, reducing each video to a median of just 1.90% of its original frames while still capturing bug moments in 98.79 of cases. These keyframes are then evaluated by a vision--language model (GPT-4o), which ranks them based on how well they match the textual bug description and selects the most representative frame. We evaluated this approach using real-world developer-submitted gameplay videos and JIRA bug reports from a popular First-Person Shooter (FPS) game. The pipeline achieves an overall F1 score of 0.79 and Accuracy of 0.89 for the top-1 retrieved frame. Performance is highest for the Lighting & Shadow (F1 = 0.94), Physics & Collision (0.86), and UI & HUD (0.83) bug categories, and lowest for Animation & VFX (0.51).
By replacing video viewing with an immediately informative image, our approach dramatically reduces manual effort and speeds up triage and regression checks, offering practical benefits to quality assurance (QA) teams and developers across the game industry.