Do Vision Language Models Understand Human Engagement in Games?

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This study presents the first systematic evaluation of whether vision-language models (VLMs) can infer players’ implicit psychological state—specifically, human engagement—from first-person shooter gameplay videos across multiple games. Using the GameVibe Few-Shot dataset, the authors assess three VLMs combined with six prompting strategies—including zero-shot, theory-guided (based on Flow, GameFlow, Self-Determination Theory, and the MDA framework), and retrieval-augmented approaches—on both pointwise engagement prediction and pairwise engagement change detection tasks across nine games. Results show that zero-shot methods consistently underperform a simple baseline; retrieval augmentation offers limited gains only in certain pointwise settings, while pairwise prediction remains persistently challenging. Moreover, theory-guided prompts fail to reliably improve performance and may instead reinforce superficial heuristics. This work highlights a significant gap between VLMs’ perceptual capabilities and genuine understanding of deep psychological states.

Technology Category

Application Category

📝 Abstract

Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

human engagement

gameplay video

psychological states

player experience

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

human engagement

gameplay video analysis