🤖 AI Summary
This study addresses the underexplored challenge of modeling continuous player engagement in first-person shooter (FPS) game videos using pre-trained multimodal large language models (MLLMs). Method: Leveraging 80 minutes of annotated GameVibe gameplay footage, we conduct over 2,400 ablation experiments, integrating frame-level video sampling with textual instructions to construct multimodal prompts; temporal ground-truth smoothing and normalization are applied to enhance label reliability. Contribution/Results: To our knowledge, this is the first work to adapt MLLMs for fine-grained, continuous affective annotation—decoupling the impacts of model architecture, scale, modality fusion strategy, prompting design, and ground-truth processing. While MLLMs generally underperform human annotators in absolute accuracy, specific model-prompt combinations (e.g., GPT-4V with temporally augmented prompting) significantly outperform baselines in short-term trend prediction. We establish a reproducible, LLM-driven affective computing benchmark, revealing both fundamental limitations and promising opportunities for modeling dynamic, subjective user experience.
📝 Abstract
Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs to annotate and successfully predict continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. Particularly in this paper, we test LLMs' ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 2,400 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains, they generally fall behind capturing continuous experience annotations provided by humans. We examine some of the underlying causes for the relatively poor overall performance, highlight the cases where LLMs exceed expectations, and draw a roadmap for the further exploration of automated emotion labelling via LLMs.