🤖 AI Summary
Large language models (LLMs) deployed in video games—e.g., for NPC behavior, opponent modeling, and scenario generation—risk propagating societal biases, undermining fairness and game balance. To address this, we introduce FairGamer, the first LLM bias evaluation benchmark specifically designed for gaming contexts. It encompasses three core gaming scenarios and defines six quantitative tasks, alongside a novel bias metric, $D_{ ext{lstd}}$, measuring lexical standard deviation across socially sensitive dimensions. Leveraging scenario-based simulations that integrate real-world cultural backgrounds with fictional game content—and conducting cross-genre experiments—we systematically demonstrate, for the first time, that LLMs exhibit isomorphic sociocultural biases in both real and virtual settings. Empirical results reveal severe imbalance: e.g., Grok-3 yields an average $D_{ ext{lstd}}$ of 0.431, confirming bias stems from intrinsic model limitations. FairGamer establishes a standardized, empirically grounded framework for trustworthy AI assessment in games.
📝 Abstract
Leveraging their advanced capabilities, Large Language Models (LLMs) demonstrate vast application potential in video games--from dynamic scene generation and intelligent NPC interactions to adaptive opponents--replacing or enhancing traditional game mechanics. However, LLMs' trustworthiness in this application has not been sufficiently explored. In this paper, we reveal that the models' inherent social biases can directly damage game balance in real-world gaming environments. To this end, we present FairGamer, the first bias evaluation Benchmark for LLMs in video game scenarios, featuring six tasks and a novel metrics ${D_lstd}$. It covers three key scenarios in games where LLMs' social biases are particularly likely to manifest: Serving as Non-Player Characters, Interacting as Competitive Opponents, and Generating Game Scenes. FairGamer utilizes both reality-grounded and fully fictional game content, covering a variety of video game genres. Experiments reveal: (1) Decision biases directly cause game balance degradation, with Grok-3 (average ${D_lstd}$ score=0.431) exhibiting the most severe degradation; (2) LLMs demonstrate isomorphic social/cultural biases toward both real and virtual world content, suggesting their biases nature may stem from inherent model characteristics. These findings expose critical reliability gaps in LLMs' gaming applications. Our code and data are available at anonymous GitHub https://github.com/Anonymous999-xxx/FairGamer .